Help needed! Aggregated counts in R

NNN123 · April 2, 2019, 8:47am

Hi All,

I've just set out using R Studio and wondered if someone could help me create a column that counts the eligible responses to a survey we have sent out at the NHS.

I have three columns:

Council code
Question Response (1-4 or 'NA')
Total Eligible Responses

I need the 'Total Eligible Responses' column to be the number of valid Question Responses per council. So a count of 'Question Response' (where not NA) grouped by 'Council Code'.

For example if in council A there were 500 valid responses, I would like 500 to be repeated for the whole column alongside council A

I've tried attaching an example but it won't let me!

Cheers for any help!

KathleenC · April 2, 2019, 12:11pm

Welcome!

I tried to take your descriptionof the data and created a sample input dataset. Note the below is just a screenshot. The code to generate that is below and is assigned to the "dataset" variable.

To do this I used datapasta to paste from Sheets to R. I recommend this fabulous post on how to generate a reprex reprex-ing with {datapasta} 🍝 · Mara Averick.

#read in sample data from Sheets using datapasta
library(tidyverse)
dataset <- data.frame(stringsAsFactors=FALSE,
   CouncilCode = c("x", "x", "x", "x", "x", "y", "y", "y", "y", "z", "z", "z",
                   "z", "z", "z"),
      Response = c(1L, 1L, NA, NA , 3L, 4L, 4L, 3L, 1L, 1L, 3L, 3L, 2L, 2L, NA)
)

On to the actual computation. What I ended up doing here is grouping by CouncilCode and summing across Response where Response is not NA.

What I end up with is a column TotalEligibleResponses that is equal to the sum of responses that are not NA.

#computation
dataset %>% group_by(CouncilCode) %>%
  mutate(TotalEligibleResponses = sum(!is.na(Response)))

Check out the TotalEligibleResponses column.

NNN123 · April 2, 2019, 1:40pm

Hey thanks for your help Kathleen! Been really struggling with this.

In your example the TotalEligibleResponses column = 12 for all councils, would it be possible to code it so the column returned 3 for 'X', 4 for 'Y' and 5 for 'Z' ? i.e. separate count per council

KathleenC · April 2, 2019, 1:56pm

This is the output in the console I'm seeing. Are you running the suggested code snippet? Could you please share what you are running to get "12" in TotalEligibleResponses?

dataset %>% group_by(CouncilCode) %>%
+   mutate(TotalEligibleResponses = sum(!is.na(Response)))
Source: local data frame [15 x 3]
Groups: CouncilCode [3]

   CouncilCode Response TotalEligibleResponses
         <chr>    <int>                  <int>
1            x        1                      3
2            x        1                      3
3            x       NA                      3
4            x       NA                      3
5            x        3                      3
6            y        4                      4
7            y        4                      4
8            y        3                      4
9            y        1                      4
10           z        1                      5
11           z        3                      5
12           z        3                      5
13           z        2                      5
14           z        2                      5
15           z       NA                      5
>

The code I am running is:

library(tidyverse)
dataset <- data.frame(stringsAsFactors=FALSE,
                      CouncilCode = c("x", "x", "x", "x", "x", "y", "y", "y", "y", "z", "z", "z",
                                      "z", "z", "z"),
                      Response = c(1L, 1L, NA, NA , 3L, 4L, 4L, 3L, 1L, 1L, 3L, 3L, 2L, 2L, NA)
)

dataset %>% group_by(CouncilCode) %>%
  mutate(TotalEligibleResponses = sum(!is.na(Response)))

NNN123 · April 2, 2019, 2:07pm

Hi Kathleen, how odd! The code i'm running is identical I think as I've copied and pasted straight out of your reply:

library(tidyverse)
dataset <- data.frame(stringsAsFactors=FALSE,
CouncilCode = c("x", "x", "x", "x", "x", "y", "y", "y", "y", "z", "z", "z",
"z", "z", "z"),
Response = c(1L, 1L, NA, NA , 3L, 4L, 4L, 3L, 1L, 1L, 3L, 3L, 2L, 2L, NA)
)

dataset %>% group_by(CouncilCode) %>%
mutate(TotalEligibleResponses = sum(!is.na(Response)))

but it is returning this:
CouncilCode Response TotalEligibleResponses

1 x 1 12
2 x 1 12
3 x NA 12
4 x NA 12
5 x 3 12
6 y 4 12
7 y 4 12
8 y 3 12
9 y 1 12
10 z 1 12
11 z 3 12
12 z 3 12
13 z 2 12
14 z 2 12
15 z NA 12

KathleenC · April 2, 2019, 2:57pm

Hmmmm...I'm super perplexed. I've tried it on another machine and the results continue to be different integers. I'm wondering if it's some sort of package issue.

Try changing library(tidyverse) to library(dplyr). I wonder what would happen in that case on your end?

Not sure why this would make a difference though... perhaps someone with a deeper understanding of tidyverse packages and functions might weigh in.

NNN123 · April 2, 2019, 3:10pm

Hi Kathleen yes it's still returning the same results for me with dplyr unfortunately! Really annoying as it would be the perfect solution for me if my outputs only matched yours!
Thank you for all your help on this

mara · April 2, 2019, 3:11pm

Could you please include a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.packages("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

There's also a nice FAQ on how to do a minimal reprex for beginners, below:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ.

NNN123 · April 3, 2019, 8:58am

Morning Mara,

I am not able to access my virtual machine today (at the NHS we require VMs in order to be able to download packages) but when I'm back in the office tomorrow I will do this and create a specific example using the reprex package.
Thank you for your ongoing help.
James

system · April 24, 2019, 8:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.