Why sum gives NAs but sum/1 not? (gapminder dataset)

dplyr

#1

I was playing with the gapminder dataset (library(gapminder) )

If

gapminder %>%
        group_by(year) %>% 
        summarise(totalPop = sum(pop))

nas

as you can see, only NAs, but if I

gapminder %>%
        group_by(year) %>% 
        summarise(totalPop = sum(pop/1))

then

nas_2

Why this happen?


#2

got it.
sum(pop) is an integer and sum(pop/1) coerce it to numeric.


#3

Edit: Ah I understand now what is happening.

x <- c(1L:3L, NA)
sum(x)
sum(x/1)

Both give NA, as they should.

However, in your case the numbers get to big to be represented as integer, so the coercion to double does the trick, since double can store larger numbers. A more explicit way to state what you are doing would be to do sum(as.numeric(pop)). You should have gotten a “integer overflow” warning when you run your code.


#4

What’s been said gets the problem and solution, but to lay out the process to figure that out, look at that other tab of output from the code chunk, which contains a bunch of warnings:

library(dplyr)

gapminder::gapminder %>% 
    group_by(year) %>% 
    summarise(pop = sum(pop))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use sum(as.numeric(.))
#> # A tibble: 12 x 2
#>     year   pop
#>    <int> <int>
#>  1  1952    NA
#>  2  1957    NA
#>  3  1962    NA
#>  4  1967    NA
#>  5  1972    NA
#>  6  1977    NA
#>  7  1982    NA
#>  8  1987    NA
#>  9  1992    NA
#> 10  1997    NA
#> 11  2002    NA
#> 12  2007    NA

As it happens, they’re pretty good warnings that say what’s happening: the result of sum is a larger integer than the machine can handle. The limit—about 2.1 billion—is stored in .Machine$integer.max:

.Machine$integer.max
#> [1] 2147483647

The warnings also tells how to avoid integer overflow: coerce to numeric. Doing so shows that the world population for all these years is, in fact, above 2.1 billion:

gapminder::gapminder %>% 
    group_by(year) %>% 
    summarise(pop = sum(as.double(pop)))
#> # A tibble: 12 x 2
#>     year        pop
#>    <int>      <dbl>
#>  1  1952 2406957150
#>  2  1957 2664404580
#>  3  1962 2899782974
#>  4  1967 3217478384
#>  5  1972 3576977158
#>  6  1977 3930045807
#>  7  1982 4289436840
#>  8  1987 4691477418
#>  9  1992 5110710260
#> 10  1997 5515204472
#> 11  2002 5886977579
#> 12  2007 6251013179