RStudio Cloud cannot read in 1GB file?

rstudio
rstudioconnect
rstudiocloud

#1

Not sure if anyone has experience this, but I was try to read in a 1GB file (uploaded on the cloud in home dir), and it seems like my script just stops due to memory issues? What’s odd is I can easily run this on my laptop with 16gb, so if this server has more than that, how is it having memory issues? Any help would be appreciated. Thanks.

https://rstudio.cloud/project/7073

sessioninfo is below:

devtools::session_info()
Session info -------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  C.UTF-8                     
 tz       Zulu                        
 date     2017-11-28                  

Packages -----------------------------------------------------------------------------------------------
 package   * version date       source        
 base      * 3.4.2   2017-09-28 local         
 compiler    3.4.2   2017-09-28 local         
 datasets  * 3.4.2   2017-09-28 local         
 devtools  * 1.13.4  2017-11-09 local         
 digest      0.6.12  2017-01-27 CRAN (R 3.3.2)
 graphics  * 3.4.2   2017-09-28 local         
 grDevices * 3.4.2   2017-09-28 local         
 knitr       1.17    2017-08-10 CRAN (R 3.4.2)
 memoise     1.1.0   2017-04-21 CRAN (R 3.4.2)
 methods   * 3.4.2   2017-09-28 local         
 stats     * 3.4.2   2017-09-28 local         
 tools       3.4.2   2017-09-28 local         
 utils     * 3.4.2   2017-09-28 local         
 withr       2.1.0   2017-11-01 CRAN (R 3.4.2)
 yaml        2.1.14  2016-11-12 CRAN (R 3.3.1)

#2

Are you sure that you have access to 16 GB of RAM on your RStudio Cloud instance? That sounds like a really, really, really generous amount of RAM for free cloud resources :open_mouth:


#3

More than 16gb:


#4

Hmmm. Perhaps the server RStudio Cloud is running on is limiting the memory available to any single project? Maybe it has several project instances sharing a server?


#5

Try cat /sys/fs/cgroup/memory/memory.limit_in_bytes instead.


How do I use the RStudio Cloud for teaching Keras and TidyVerse?
How much RAM is available?
#6

I see, now I get: 1073741824 bytes~1GB. So is it still possible to read in a 1GB file on this then? Is the number of processors, mem, CPU’s capped per project?


#7

You are correct: during the RStudio.cloud Alpha, memory and CPU resources are capped. We’re currently restricting memory usage to 1GB, which may limit your ability to read large files. Obviously this is subject to change, as we learn more about the type of workloads people are looking at running. We really appreciate the feedback!


#8

@pgensler This may go without saying, but while you can theoretically read a 1GB file into a system with ~1GB of memory, in practice this is much more challenging due to R’s inherent copy on modify semantics and the memory that is allocated to other things (i.e. R’s operations, etc.).

Although it can be a little irritating at the outset, working in a lower-memory environment can encourage the user to learn ways around their hardware’s limitations. Some areas I might suggest for exploration:

  • The readr functions that focus on chunked sets of records (so you never hold the whole file in memory)
  • (In concert with above) moving the data into a database… i.e. SQLite, even. Not sure if this is an option in RStudio.cloud. There is a good article on this that I am having trouble locating…
  • Spark (via sparklyr) locally or in a cluster

The latter two can be especially useful when you control the entire system (i.e. not sure how relevant they are for RStudio.cloud Alpha) and can facilitate lots of interesting learning! Admittedly, they can also be less than ideal situations that take longer to process information.

I lived in this world for a year or so, doing data analysis on a 10-yr-old hobby laptop with 4gb RAM. It was not always pleasant, but it taught me a lot about ways to circumvent hardware limitations when resources ($$) were limited! There are also lots of free credits at AWS / Google Cloud for spinning up databases, VMs, etc. to offload work onto, as well! It’s just a matter of figuring out the proper combination of convenience, cost efficiency, and learning for your situation.


#9

It definitely seems like a very powerful tool, but I’ve never been able to get it to work properly(with files >1GB).
I’ve always struggled to grasp how exactly to use readr’s chunking mechanism. I was trying to recreate an function that someone helped me with (probably not the best example to use, as there’s two different chunks in use), but I kept running into an error (link to full question here):

ReadEmAndWeep <- function(file, chunk_size) {
  f <- function(chunk, pos) {
    data_frame(text = chunk) %>%
      filter(text != "") %>%
      separate(text, c("var", "value"), ":", extra = "merge") %>%
      mutate(
        chunk_id = rep(1:(nrow(.) / 9), each = 9),
        value = trimws(value)
      ) %>%
      spread(var, value)
  }

  read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}


Error in read_lines_chunked_(ds, locale, na, chunk_size, callback, progress) : 
  Evaluation error: Column `chunk_id` must be length 92858 (the number of rows) or one, not 92853

This is probably a very convoluted example, so maybe the better approach for this is to use something like
https://github.com/jeremystan/tidyjson , as a DataframeCallbackNew(f) function to read in the file.

Curious to get your thoughts on if this is how the read_lines_chunked is supposed to work. At first, this seems like an encoding issue, but maybe I’m just doing too much inside the function?


#10

That error message looks like it is coming from the definition of chunk_id in your mutate statement. I will try to overview my understanding / intuition - if I get a chance later, I will take a look at your specific question in particular. In any case, the way I think about any of the chunked functions is as follows:

  • The read_*_chunked function itself takes care of “chopping the file up” into chunks, where a chunk is processed as the read_* part of read_*_chunked indicates (i.e. read_delim_chunked will pass each chunk to read_delim before it goes into the callback)
  • Whatever callback I choose (which defines the callback output) will be called for each chunk with the chunk and pos parameters (where pos is the starting line for the chunk, and chunk is the value of the chunk from the previous bullet)
  • Output that is returned from the function depends on the callback that I have selected (i.e. SideEffectChunkCallback has no output)

Perhaps an example that is more illustrative (I truncated some output for brevity):

library(readr)                                                  
write_csv(iris,'tmp_iris.csv')                                  
                                                                
# simple function to print values                               
f <- function(chunk, pos) {                                     
print(pos)                                                      
print(chunk)                                                    
}                                                               
                                                                
# here - each chunk is processed by `read_delim`                
# there is no output (just calling for a side-effect each time) 
read_delim_chunked(file='tmp_iris.csv'                          
, callback=SideEffectChunkCallback$new(f)                       
, delim=','                                                     
, chunk_size = 10                                               
)                                                               
#> Parsed with column specification:
#> cols(
#>   Sepal.Length = col_double(),
#>   Sepal.Width = col_double(),
#>   Petal.Length = col_double(),
#>   Petal.Width = col_double(),
#>   Species = col_character()
#> )
#> [1] 1
#> # A tibble: 10 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#>  1          5.1         3.5          1.4         0.2  setosa
#>  2          4.9         3.0          1.4         0.2  setosa
#>  3          4.7         3.2          1.3         0.2  setosa
#>  4          4.6         3.1          1.5         0.2  setosa
#>  5          5.0         3.6          1.4         0.2  setosa
#>  6          5.4         3.9          1.7         0.4  setosa
#>  7          4.6         3.4          1.4         0.3  setosa
#>  8          5.0         3.4          1.5         0.2  setosa
#>  9          4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> [1] 11
#> # A tibble: 10 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#>  1          5.4         3.7          1.5         0.2  setosa
#>  2          4.8         3.4          1.6         0.2  setosa
#>  3          4.8         3.0          1.4         0.1  setosa
#>  4          4.3         3.0          1.1         0.1  setosa
#>  5          5.8         4.0          1.2         0.2  setosa
#>  6          5.7         4.4          1.5         0.4  setosa
#>  7          5.4         3.9          1.3         0.4  setosa
#>  8          5.1         3.5          1.4         0.3  setosa
#>  9          5.7         3.8          1.7         0.3  setosa
#> 10          5.1         3.8          1.5         0.3  setosa
...
                                                                
# here - each chunk is processed by `read_lines`                
# there is no output  (just calling for a side-effect each time)
read_lines_chunked(file='tmp_iris.csv'                          
, callback=SideEffectChunkCallback$new(f)                       
, chunk_size = 10)                                              
#> [1] 1
#>  [1] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#>  [2] "5.1,3.5,1.4,0.2,setosa"                                   
#>  [3] "4.9,3,1.4,0.2,setosa"                                     
#>  [4] "4.7,3.2,1.3,0.2,setosa"                                   
#>  [5] "4.6,3.1,1.5,0.2,setosa"                                   
#>  [6] "5,3.6,1.4,0.2,setosa"                                     
#>  [7] "5.4,3.9,1.7,0.4,setosa"                                   
#>  [8] "4.6,3.4,1.4,0.3,setosa"                                   
#>  [9] "5,3.4,1.5,0.2,setosa"                                     
#> [10] "4.4,2.9,1.4,0.2,setosa"                                   
#> [1] 11
#>  [1] "4.9,3.1,1.5,0.1,setosa" "5.4,3.7,1.5,0.2,setosa"
#>  [3] "4.8,3.4,1.6,0.2,setosa" "4.8,3,1.4,0.1,setosa"  
#>  [5] "4.3,3,1.1,0.1,setosa"   "5.8,4,1.2,0.2,setosa"  
#>  [7] "5.7,4.4,1.5,0.4,setosa" "5.4,3.9,1.3,0.4,setosa"
#>  [9] "5.1,3.5,1.4,0.3,setosa" "5.7,3.8,1.7,0.3,setosa"
...
                                                                
                                                                
# simple function that returns chunk                            
return_chunk <- function(chunk, pos) {                          
return(chunk)                                                   
}                                                               
                                                                
# here - processed by `read_delim`                              
# output is a data.frame (aggregate all chunks together)        
output <- read_delim_chunked(file='tmp_iris.csv'                
, callback=DataFrameCallback$new(return_chunk)                  
, delim=','                                                     
, chunk_size=10)                                                
#> Parsed with column specification:
#> cols(
#>   Sepal.Length = col_double(),
#>   Sepal.Width = col_double(),
#>   Petal.Length = col_double(),
#>   Petal.Width = col_double(),
#>   Species = col_character()
#> )
                                                                
print(output)                                                   
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#>  1          5.1         3.5          1.4         0.2  setosa
#>  2          4.9         3.0          1.4         0.2  setosa
#>  3          4.7         3.2          1.3         0.2  setosa
#>  4          4.6         3.1          1.5         0.2  setosa
#>  5          5.0         3.6          1.4         0.2  setosa
#>  6          5.4         3.9          1.7         0.4  setosa
#>  7          4.6         3.4          1.4         0.3  setosa
#>  8          5.0         3.4          1.5         0.2  setosa
#>  9          4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 140 more rows
                                                                
# here - processed by `read_lines`                              
# output is coerced to a data.frame (ugly)                      
output2<- read_lines_chunked(file='tmp_iris.csv'                
, callback=DataFrameCallback$new(return_chunk)                  
, chunk_size=10)                                                
                                                                
print(output2)                                                  
#>       [,1]                                                       
#>  [1,] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
#>  [2,] "4.9,3.1,1.5,0.1,setosa"                                   
#>  [3,] "5.1,3.8,1.5,0.3,setosa"                                   
#>  [4,] "4.7,3.2,1.6,0.2,setosa"                                   
#>  [5,] "5.1,3.4,1.5,0.2,setosa"                                   
#>  [6,] "5,3.3,1.4,0.2,setosa"                                     
#>  [7,] "5.2,2.7,3.9,1.4,versicolor"                               
#>  [8,] "5.6,2.5,3.9,1.1,versicolor"                               
#>  [9,] "5.7,2.6,3.5,1,versicolor"                                 
#> [10,] "5.5,2.5,4,1.3,versicolor"                                 
#> [11,] "5.7,2.8,4.1,1.3,versicolor"                               
#> [12,] "7.2,3.6,6.1,2.5,virginica"                                
#> [13,] "6,2.2,5,1.5,virginica"                                    
#> [14,] "7.2,3,5.8,1.6,virginica"                                  
#> [15,] "6.9,3.1,5.4,2.1,virginica"                                
#> [16,] "5.9,3,5.1,1.8,virginica"                                  
#>       [,2]                         [,3]                        
#>  [1,] "5.1,3.5,1.4,0.2,setosa"     "4.9,3,1.4,0.2,setosa"      
#>  [2,] "5.4,3.7,1.5,0.2,setosa"     "4.8,3.4,1.6,0.2,setosa"    
#>  [3,] "5.4,3.4,1.7,0.2,setosa"     "5.1,3.7,1.5,0.4,setosa"    
#>  [4,] "4.8,3.1,1.6,0.2,setosa"     "5.4,3.4,1.5,0.4,setosa"    
#>  [5,] "5,3.5,1.3,0.3,setosa"       "4.5,2.3,1.3,0.3,setosa"    
#>  [6,] "7,3.2,4.7,1.4,versicolor"   "6.4,3.2,4.5,1.5,versicolor"
#>  [7,] "5,2,3.5,1,versicolor"       "5.9,3,4.2,1.5,versicolor"  
#>  [8,] "5.9,3.2,4.8,1.8,versicolor" "6.1,2.8,4,1.3,versicolor"  
#>  [9,] "5.5,2.4,3.8,1.1,versicolor" "5.5,2.4,3.7,1,versicolor"  
#> [10,] "5.5,2.6,4.4,1.2,versicolor" "6.1,3,4.6,1.4,versicolor"  
#> [11,] "6.3,3.3,6,2.5,virginica"    "5.8,2.7,5.1,1.9,virginica" 
#> [12,] "6.5,3.2,5.1,2,virginica"    "6.4,2.7,5.3,1.9,virginica" 
#> [13,] "6.9,3.2,5.7,2.3,virginica"  "5.6,2.8,4.9,2,virginica"   
#> [14,] "7.4,2.8,6.1,1.9,virginica"  "7.9,3.8,6.4,2,virginica"   
#> [15,] "6.7,3.1,5.6,2.4,virginica"  "6.9,3.1,5.1,2.3,virginica" 
#> [16,] "5.9,3,5.1,1.8,virginica"    "5.9,3,5.1,1.8,virginica"   
#>       [,4]                         [,5]                        
#>  [1,] "4.7,3.2,1.3,0.2,setosa"     "4.6,3.1,1.5,0.2,setosa"    
#>  [2,] "4.8,3,1.4,0.1,setosa"       "4.3,3,1.1,0.1,setosa"      
#>  [3,] "4.6,3.6,1,0.2,setosa"       "5.1,3.3,1.7,0.5,setosa"    
#>  [4,] "5.2,4.1,1.5,0.1,setosa"     "5.5,4.2,1.4,0.2,setosa"    
#>  [5,] "4.4,3.2,1.3,0.2,setosa"     "5,3.5,1.6,0.6,setosa"      
#>  [6,] "6.9,3.1,4.9,1.5,versicolor" "5.5,2.3,4,1.3,versicolor"  
#>  [7,] "6,2.2,4,1,versicolor"       "6.1,2.9,4.7,1.4,versicolor"
#>  [8,] "6.3,2.5,4.9,1.5,versicolor" "6.1,2.8,4.7,1.2,versicolor"
#>  [9,] "5.8,2.7,3.9,1.2,versicolor" "6,2.7,5.1,1.6,versicolor"  
#> [10,] "5.8,2.6,4,1.2,versicolor"   "5,2.3,3.3,1,versicolor"    
#> [11,] "7.1,3,5.9,2.1,virginica"    "6.3,2.9,5.6,1.8,virginica" 
#> [12,] "6.8,3,5.5,2.1,virginica"    "5.7,2.5,5,2,virginica"     
#> [13,] "7.7,2.8,6.7,2,virginica"    "6.3,2.7,4.9,1.8,virginica" 
#> [14,] "6.4,2.8,5.6,2.2,virginica"  "6.3,2.8,5.1,1.5,virginica" 
#> [15,] "5.8,2.7,5.1,1.9,virginica"  "6.8,3.2,5.9,2.3,virginica" 
#> [16,] "5.9,3,5.1,1.8,virginica"    "5.9,3,5.1,1.8,virginica"   
...

All that is important for the callback function (f or return_chunk above) is that it knows how to handle a chunk of data, however it is being passed (i.e. by read_delim or read_lines). I may be misunderstanding how you are doing things, but chunk_id is probably easier to determine by using the pos variable. pos + nrow(chunk) or length(chunk) gives you the end row of the chunk, as well, depending on the function you are using. It is easy enough to determine an “iteration number” from those values if you fix chunk_size.

Now you do have to be careful, because if you are not subsetting with filter, the DataFrameCallback will just stream the whole file into memory (that’s what I did above). A classic example is to use SideEffectChunkCallback and insert the rows into a database - if I were doing so in these examples I never would have held more than 10 rows in memory at a given time. The other thing I have done before is stored the line numbers that were interesting to me with the DataFrameCallback and then gone back over the file in a subsequent pass to extract the interesting rows (I could not be certain that the “chunk” of data I wanted would live within one of the “readr chunks”… large, MULTI-line JSON objects!). Lots of interesting approaches to explore!


#11

@cole I see you’ve done a bit of work on tidyjson…are you maintaining the project now? The reason I ask is because I wonder if it would be worthwhile to have a function that could attempt to read in a flattened out JSON file like my example. This data actually inspired the function spread_all in the tidyjson package.

What’s odd is that while changing the data_frame line from above below to use read_lines works with this line:
data_frame(text = read_lines(x))
and just calling the function normally: data <- f(<my_file.txt>)

However, when I try to refactor using the Chunk methods, it fails:

f <- function(dataframe, pos) {                                     
  data_frame(text = dataframe) %>%
    filter(text != "") %>%
    separate(text, c("var", "value"), ":", extra = "merge") %>%
    mutate(
      chunk_id = rep(1:(nrow(.) / 13), each = 13),
      value = trimws(value)
    ) %>%
    spread(var, value)
}
file <- read_lines_chunked(Ratebeer, callback = DataFrameCallback$new(f), chunk_size = 100000)
Error in read_lines_chunked_(ds, locale, na, chunk_size, callback, progress) : 
  Column `chunk_id` must be length 92858 (the number of rows) or one, not 92846

At first this seems like it could be an encoding issue (some odd whitespace char in the string), but the data is UTF-8, so I’m not sure what exactly I’m doing wrong here.


#12

What happens if you remove the chunk_id = ... column from mutate? I really think that computation is your problem. I haven’t used the RStudio debugging tool with read_lines_chunked, but you might try something like the following and see if stepping through it gives you a better idea of what is going on. It also will give you access to see what is going into the callback when it is called:

debugonce(f)
file <- read_lines_chunked...

On the tidyjson front, I’m not quite sure how to answer. The best answer is perhaps that I would like to be maintaining it (or to contribute to someone else maintaining it). I haven’t been able to get in touch with the current maintainer for ~ 6 months, so I’m a bit stymied on how to proceed. I have added a bunch of features / fixes to colearendt/tidyjson, but the current “master” is still at jeremystan/tidyjson (sailthru/tidyjson is long deprecated, although I think it represents the most recent CRAN version). If you go look at colearendt/tidyjson, I have to apologize for how badly I have documented / organized my changes…

I am a big fan of the package, in general, though! It makes working with JSON data much nicer in R. If you have a simple example you can throw in an issue at jeremystan/tidyjson, as well as the functionality you would be looking for, I think that would be a good place to catalog the idea!


#13

No doubt thats the problematic line of code, but when I remove that line, I get an error about duplicate identifiers for rows, which makes sense because the key:value pair schema repeats every 13 lines.
Sorry, the reprex output is incredibly long:

> library(tidyverse)
> #read data into one massive column, and operate on that in chunks? 
> data <- "/Users/petergensler/Desktop/Beer_Analysis/CleanData/Beeradvocate.txt"
> 
> 
> f <- function(dataframe, pos) {                                     
+   data_frame(text = dataframe) %>%
+     filter(text != "") %>%
+     separate(text, c("var", "value"), ":", extra = "merge") %>%
+     spread(var, value)
+ }
> file <- read_lines_chunked(data, callback = DataFrameCallback$new(f), chunk_size = 100000)
Error in read_lines_chunked_(ds, locale, na, chunk_size, callback, progress) : 
  Duplicate identifiers for rows (4, 17, 30, 43, 56, 69, 82, 95, 108, 121, 134, 147, 160, 173, 186, 199, 212, 225, 238, 251, 264, 277, 290, 303, 316, 329, 342, 355, 368, 381, 394, 407, 420, 433, 446, 459, 472, 485, 498, 511, 524, 537, 550, 563, 576, 589, 602, 615, 628, 641, 654, 667, 680, 693, 706, 719, 732, 745, 758, 771, 784, 797, 810, 823, 836, 849, 862, 875, 888, 901, 914, 927, 940, 953, 966, 979, 992, 1005, 1018, 1031, 1044, 1057, 1070, 1083, 1096, 1109, 1122, 1135, 1148, 1161, 1174, 1187, 1200, 1213, 1226, 1239, 1252, 1265, 1278, 1291, 1304, 1317, 1330, 1343, 1356, 1369, 1382, 1395, 1408, 1421, 1434, 1447, 1460, 1473, 1486, 1499, 1512, 1525, 1538, 1551, 1564, 1577, 1590, 1603, 1616, 1629, 1642, 1655, 1668, 1681, 1694, 1707, 1720, 1733, 1746, 1759, 1772, 1785, 1798, 1811, 1824, 1837, 1850, 1863, 1876, 1889, 1902, 1915, 1928, 1941, 1954, 1967, 1980, 1993, 2006, 2019, 2032, 2045, 2058, 2071, 2084, 2097, 2110, 2123, 2136, 2149, 2162, 2175, 2188, 2201, 2214, 2227, 2240, 2253,

#14

Perhaps I am thinking about this incorrectly, but to create a unique id, what about something like:

... %>%
mutate(row_id = row_number() + pos - 1) %>%
...

You are defining chunk_id separately from the chunk being passed in by the read_lines_chunked function, right? If so, I would advocate for trying a different naming convention (I know the mixing there confused me a little). Apologies for not testing these thoughts - this is the data from the SO post you referenced a while back? Is the expected output there as well? I will try to take a more focused look later if this does not resolve!


#15

Yep, which I think you may be spot on in that I may be using the wrong logic to make the row_id. My concern is that the error (which comes from dplyr) is not truly ‘failing fast’, or hard enough to tell me whats going on when the error is spit out (like that’s not the true problem).

I just tested that out, and that seems to work just fine, which is good news!

No worries, the naming here is terrible (it’s not my code), but I think you have the right idea. Could you also do something with a ceiling function as well to get the correct schema? Like ceiling(row_number/13) ?


#16

Interesting thought. Referencing the data in the SO post, it does not seem that ceiling(row_number()/13) would really do what you want. In order for that to work, you would either need your readr chunk_size set to only read one section of lines at a time, or you would need to already be using group_by for a section of lines, which is begging the question / circular reasoning / assuming your conclusion.

Further, I am not 100% what it is that you want as output, so it is hard to suggest hard-and-fast suggestions. In any case, my thoughts on changing up your approaches:

  1. If “group” size is consistent, consider setting readr chunk_size to be the size of the groups, so that you are only reading one group of rows at a time and can operate on them accordingly. In this case, the pos parameter will be your group_id of sorts
  2. The most straightforward approach that allows varying readr chunk_size, in my opinion, is to read through the file twice… once for determining where group breaks are (aggregating a vector of where the blank lines are), and then another time either explicitly reading groups of rows by setting start / end on the normal read_lines, or using that acquired vector to define groups of records / assign group_ids as you go.
  3. The data format you are working here is ALMOST yaml. As such, you could also conceive of finding a way to parse the YAML as you go (using the yaml package. The one trick here is preserving the names when you’re throwing it into a data_frame). This would require using the approach of the previous item without knowing the group breaks up front, so it would be a little tricky
  4. Obviously you could do that same approach without parsing the file as YAML.

A random thought on item 4 to help you on your way:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
library(tidyverse)                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
library(yaml)                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
writeLines(...data from SO                                                                                                                                                                                                                                                                               
, con='Beeradvocate.txt')                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
data <- "Beeradvocate.txt"                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
# Attempt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
...                                                                                                                                                                                                                                                                                                                                                                                               
#> Warning: Too few values at 1 locations: 10
#> Error in read_lines_chunked_(ds, locale, na, chunk_size, callback, progress): Evaluation error: Column `chunk_id` must be length 19 (the number of rows) or one, not 13.

# Attempt 2                                                                                                                                                                                                                                                                                                                                                                                                                             
...                                                                                                                                                                                                                                                                                                                                                                                                   
#> Warning: Too few values at 1 locations: 10
#> Error in read_lines_chunked_(ds, locale, na, chunk_size, callback, progress): Evaluation error: Duplicate identifiers for rows (5, 15), (9, 19), (6, 16), (7, 17), (8, 18), (1, 11), (3, 13), (2, 12), (4, 14).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
# Attempt 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
f <-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
function(dataframe, pos) {                                                                                                                                                                                                                                                                                                                                                                                                                                                              
data_frame(text = dataframe) %>%                                                                                                                                                                                                                                                                                                                                                                                                                                                        
filter(text != "") %>%                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
mutate(row_id = row_number() + pos - 1) %>%                                                                                                                                                                                                                                                                                                                                                                                                                                             
separate(text, c("var", "value"), ":", extra = "merge") %>%                                                                                                                                                                                                                                                                                                                                                                                                                             
spread(var, value)                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
file <-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
read_lines_chunked(data, callback = DataFrameCallback$new(f), chunk_size = 100000)                                                                                                                                                                                                                                                                                                                                                                                                      
#> Warning: Too few values at 1 locations: 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
# Not sure what your goal is?  Maybe worth rethinking approach 
# (and remember that sharing your desired output helps others help you)                                                                                                                                                                                                                                                                                                                                    
f <-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
function(chunk, pos) {                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
#print(chunk)                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
#print(pos)                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
blank_lines <- str_which(chunk, '^\\s*$')                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
group_def <- c(pos, blank_lines, length(chunk))                                                                                                                                                                                                                                                                                                                                                                                                                                         
group_def_df <- data_frame(start=group_def[1:(length(group_def)-1)], end=group_def[2:(length(group_def))])                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
collapsed <- mapply(FUN=                                                                                                                                                                                                                                                                                                                                                                                                                                                                
function(character, start, end) {                                                                                                                                                                                                                                                                                                                                                                                                                                                       
coll <- paste(character[start:end],collapse='\n')                                                                                                                                                                                                                                                                                                                                                                                                                                      
parsed <- yaml.load(coll)                                                                                                                                                                                                                                                                                                                                                                                                                                                               
return(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
data_frame(parsed=list(parsed), start=start, end=end,raw=coll)                                                                                                                                                                                                                                                                                                                                                                                                                          
)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
, character=list(chunk)                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
, start=group_def_df$start                                                                                                                                                                                                                                                                                                                                                                                                                                                              
, end=group_def_df$end                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
, SIMPLIFY = FALSE)                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
return(bind_rows(collapsed))                                                                                                                                                                                                                                                                                                                                                                                                                                                            
}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
read_lines_chunked(data, callback = DataFrameCallback$new(f), chunk_size = 100000)                                                                                                                                                                                                                                                                                                                                                                                                      
#> # A tibble: 2 x 4
#>       parsed start   end
#>       <list> <int> <int>
#> 1 <list [9]>     1    10
#> 2 <list [9]>    10    19
#> # ... with 1 more variables: raw <chr>

My handling of “groups of records” here did not take into account partial groups in a previous or the current chunk. You would need to handle that in the output data_frame and mash the rows together because you cannot access previous / later chunks in read_*_chunked (i.e. the reason for approaches 1/2).

I think one issue here is that you are still thinking about read_lines_chunked a bit incorrectly, presuming that it will “know” when it encounters a group of records. You can accomplish this (with items 1 or 2 above), but it is not guaranteed.


#17

Yeah so I have two datasets which are structured like this…one has data in chunks of 9, the other in chunks of 13, so I’m presuming the same methodology can be used between files. Thanks for your help on this!