Help with dplyr to calculate sum total based on repeated variable

aarontimo · November 15, 2022, 1:01am

Hi there,

I am trying to calculate the population based on postcode from Census data. I have combined census data with a list of Queensland postcodes. Here is a sample of the data:

data.frame(
  stringsAsFactors = FALSE,
          Postcode = c(4161L,4161L,4161L,4161L,
                       4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
                       4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
                       4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
                       4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
                       4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
                       4155L,4155L,4155L,4155L,4155L),
          SA2.name = c("Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Alexandra Hills",
                       "Alexandra Hills","Alexandra Hills","Belmont - Gumdale",
                       "Belmont - Gumdale","Belmont - Gumdale","Belmont - Gumdale",
                       "Belmont - Gumdale"),
        Population = c(419L,335L,589L,381L,202L,
                       249L,278L,455L,547L,409L,547L,255L,330L,497L,466L,
                       535L,423L,524L,442L,299L,469L,378L,502L,225L,
                       294L,407L,269L,455L,401L,608L,425L,192L,434L,354L,
                       324L,453L,296L,272L,439L,454L,359L,338L,4L,0L,
                       372L,246L,455L,275L,527L,425L)
)

I'm just not sure how to interrogate the data to tally all the values in the $Population column that share the same postcode.

I'm not sure whether to pivot_wider so all the population values are in a row or if I need to extract the variables.

I've tried the following code:

``` r
postcode_pop <- joined_SA2 %>%
  group_by(Postcode) %>%
  count(Population)
#> Error in joined_SA2 %>% group_by(Postcode) %>% count(Population): could not find function "%>%"

But I get:

# A tibble: 6 × 3
# Groups:   Postcode [1]
  Postcode Population     n
     <int>      <int> <int>
1     4000          0     8
2     4000          3     8
3     4000          4     8
4     4000         17     8
5     4000         32     8
6     4000        156     8

when I want something more like this:

# Groups:   Postcode [1]
  Postcode Population    
     <int>      <int> 
1     4000         25956  
2     4001         13245

Your help or advice is (as always) much appreciated!

technocrat · November 15, 2022, 1:52am

You want sum instead of count

aarontimo · November 15, 2022, 2:04am

Thank you @technocrat. When I change the code to:

postcode_pop <- joined_SA2 %>%
  group_by(Postcode) %>%
  sum(Population)

I get the following error:

Error in joined_SA2 %>% group_by(Postcode) %>% sum(Population) :
object 'Population' not found

I've check the column name and it is Population:

> colnames(joined_SA2)
 [1] "ID"                   "Postcode"             "Locality"             "State"               
 [5] "Longitude"            "Latitude"             "Category"             "Type"                
 [9] "SA3"                  "SA3.Name"             "SA4"                  "SA4.Name"            
[13] "Region"               "Status"               "Lat..Google."         "Long..Google."       
[17] "SA1.MAINCODE.2011"    "SA12016.x"            "SA22016"              "SA2.NAME.2016"       
[21] "RA.2011"              "RA.2016"              "MMM.2015"             "MMM.2019"            
[25] "CED"                  "Altitude"             "Charge.Zone"          "PHN.Code"            
[29] "PHN.Name"             "LGA.Region"           "Electorate"           "Electorate.Rating"   
[33] "X2016_SA1_population" "Year"                 "SA12016.y"            "SA2.name"            
[37] "Population"

So confused...

technocrat · November 15, 2022, 2:29am

sum() is wrong, because it doesn't "catch" the implicit namespace of the data frame; for that we needed to wrap in summarise (for summarize if rebelling against Commonwealth usage)s.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
dat <- data.frame(
  Postcode = c(4161L,4161L,4161L,4161L,
               4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
               4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
               4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
               4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
               4161L,4161L,4161L,4161L,4161L,4161L,4161L,4161L,
               4155L,4155L,4155L,4155L,4155L),
  SA2.name = c("Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Alexandra Hills",
               "Alexandra Hills","Alexandra Hills","Belmont - Gumdale",
               "Belmont - Gumdale","Belmont - Gumdale","Belmont - Gumdale",
               "Belmont - Gumdale"),
  Population = c(419L,335L,589L,381L,202L,
                 249L,278L,455L,47L,409L,547L,255L,330L,497L,466L,
                 535L,423L,524L,442L,299L,469L,378L,502L,225L,
                 294L,407L,269L,455L,401L,608L,425L,192L,434L,354L,
                 324L,453L,296L,272L,439L,454L,359L,338L,4L,0L,
                 372L,246L,455L,275L,527L,425L)
)

# we left out a keyword
dat %>% group_by(Postcode) %>% summarise(sum(Population))
#> # A tibble: 2 × 2
#>   Postcode `sum(Population)`
#>      <int>             <int>
#> 1     4155              1928
#> 2     4161             16406

aarontimo · November 15, 2022, 3:27am

I can't thank you enough! Makes sense when you see the solution.

system · November 22, 2022, 3:28am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.