Better guess_max value for read_xlsx in readxl package?

Can anyone explain why the default guess_max = min(1000, n_max)
is a good idea for read_xlsx? Doesn’t this trade off a few milliseconds of time to guess an answer that is likely wrong on larger files? Isn’t this an example of Knuth’s “premature optimization”?

I received hundreds of errors in trying to read an Excel file with ~88,000 lines because the default guess_max is so small. I then needed to find out how many lines are in the file, or just pick a very large number – perhaps 2^20 since that’s an Excel limit. Why can’t I specify guess_max = “n_max” without specifying a number? Why isn’t guess_max = “n_max” the default given the speed of modern computers?

I received no errors in reading the 88,000-line Excel file with guess_max = 2^20, and that only took a few seconds.

I think the default has been made to 1000L to be consistent with readr from where the idea is from. I can't explain why this is this value but I think most of the time this is enough for the guessing mechanism.
Also, if you have an excel with lots of column, and you select by default every row, there is a chance you'll be limit in RAM to load the data. The default may be there to prevent issue and work in all case.

For other case, it is very easy to change the default and use a maximum number of line (with guess_max=Inf). You'll have a warning because there is a limit to .Machine$integer.max%/%100.

datasets <- readxl_example("datasets.xlsx")
read_excel(datasets, guess_max = Inf)
#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#> readxl works best with a newer version of the tibble package.
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

Created on 2018-12-28 by the reprex package (v0.2.1)

You can also easily provide the column type if you know them and bypass the guessing mechanism.

Hope it helps getting more context.

We'll see if someone from the tidyverse team has a better explanation about the limit of 1000. :smile:

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.