I can reliably generate a stack imbalance
warning or crash when running fread
on the latest dev builds of data.table
. (Note that fread
is parallelized.) After raising the issue and troubleshooting a lot of different fixes with Matt, I'm fairly sure that this is a RStudio issue, not a data.table
one:
Stack imbalance in fread · Issue #2481 · Rdatatable/data.table · GitHub
To summarize, the stack imbalance does not occur on an earlier version of RStudio, or running the same code from Rgui, the terminal, or the command prompt. I believe the problem is limited to Windows.
The stack imbalance can be reproduced on Windows RStudio version 1.1.383 by unzipping
ABS-data/inbox/SA2-by-DJZ-2011.zip at master · HughParsonage/ABS-data · GitHub
then running the following in that working directory:
library(data.table)
#> data.table 1.10.5 IN DEVELOPMENT built 2017-11-13 02:46:28 UTC; appveyor
#> The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#> Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#> Release notes, videos and slides: http://r-datatable.com
fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", verbose = TRUE)
Result:
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 12 threads (omp_get_max_threads()=12, nth=12)
NAstrings = [<<>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file SA2-by-DJZ-2011.csv
File opened, size = 349.4MB (366418725 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Australian Bureau of Statistic>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 89 lines of 4 fields using quote rule 0
Detected 4 columns on line 12. This line is either column names or first data row. Line starts as: <<"Goulburn","110018063",3499,>>
Quote rule picked = 0
fill=false and the most number of columns found is 4
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to false
Number of sampling jump points = 101 because (366418375 bytes from row 1 to eof) / (2 * 1457 jump0size) == 125744
Type codes (jump 000) : 1551 Quote rule 0
Type codes (jump 100) : 11051 Quote rule 0
=====
Sampled 10027 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 12 to the end of last row: 366418143
Line length: mean=16.02 sd=0.21 min=16 max=29
Estimated number of rows: 366418143 / 16.02 = 22877178
Initial alloc = 25164895 rows (22877178 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 11051
[10] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 25164895 rows
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
[12] Finalizing the datatable
Read 22885380 rows x 4 columns from 349.4MB (366418725 bytes) file in 00:02.550 wall clock time
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
Final type counts
0 : drop
1 : bool8
0 : bool8
0 : bool8
0 : bool8
1 : int32
0 : int64
0 : float64
0 : float64
0 : float64
2 : string
Rereading 1 columns due to out-of-sample type exceptions.
Column 1 ("") bumped from 'bool8' to 'string' due to <<"Goulburn">> on row 0
[11] Read the data
jumps=[0..360), chunk_size=1017828, total_size=366418143
Read 94%. ETA 00:00 Warning: stack imbalance in '$', 27 then 28
Read 98%. ETA 00:00
[12] Finalizing the datatable
Reread 22885380 rows x 1 columns in 00:00.991
Read 22885380 rows. Exactly what was estimated and allocated up front
=============================
0.006s ( 0%) Memory map 0.341GB file
0.011s ( 0%) sep=',' ncol=4 and header detection
0.002s ( 0%) Column type detection using 10027 sample rows
0.328s ( 9%) Allocation of 22885380 rows x 4 cols (0.469GB)
3.194s ( 90%) Reading 360 chunks of 0.971MB (63547 rows) using 12 threads
= 0.001s ( 0%) Finding first non-embedded \n after each jump
+ 0.362s ( 10%) Parse to row-major thread buffers
+ 1.963s ( 55%) Transpose
+ 0.868s ( 25%) Waiting
0.991s ( 28%) Rereading 1 columns due to out-of-sample type exceptions
3.541s Total
Warning: stack imbalance in 'withVisible', 3 then 5
Warning messages:
1: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Starting data input on line 12 <<"Goulburn","110018063",3499,>> with 4 fields and discarding line 11 <<"Main Statistical Area Structu>> before it because it has a different number of fields (3).
2: In fread("SA2-by-DJZ-2011.csv", header = FALSE, na.strings = "", :
Found the last consistent line but text exists afterwards. Consider fill=TRUE and/or blank.lines.skip=TRUE. First 200 characters of discarded line: <<"Dataset: 2011 Census of Population and Housing">>