Issue when using the group_by function in tidyverse package

Hello! So I am relatively new to R and have been using the group_by function to work on a dataset and compare variables. I have run into 2 different issues when using the function that may be due to incorrect syntax.

In the first, I turned a numeric variable (the duration of a project) into a factored categorical variable by percentile:

Percentile_00 = min(DSI_kickstarterscrape_dataset$duration)
Percentile_33 = quantile(DSI_kickstarterscrape_dataset$duration, 0.33333)
Percentile_67 = quantile(DSI_kickstarterscrape_dataset$duration, 0.66667)
Percentile_100 = max(DSI_kickstarterscrape_dataset$duration)
RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)

dimnames(RB)[[2]] = "Value"

RB
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_00 & DSI_kickstarterscrape_dataset$duration < Percentile_33] = "Lower_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_33 & DSI_kickstarterscrape_dataset$duration < Percentile_67] = "Middle_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_67 & DSI_kickstarterscrape_dataset$duration <= Percentile_100] = "Upper_third"

Then I used the group_by function to summarise the mean of a variable (pledges for projects) by the percentiles in the GroupDuration variable

mean_by_GroupDuration <- my_data %>%
group_by(DSI_kickstarterscrape_dataset$GroupDuration) %>%
summarize(mean(DSI_kickstarterscrape_dataset$pledged,na.rm = TRUE)) %>%

Calling the mean_by_GroupDuration created a table which showed the same mean pledge for each percentile of the Duration variable. This appears to have worked but does this make sense statistically?

In the second example I ran all of the previous code the same way the only change being swapping the variables. I grouped by a goal variable and then wanted to summarize a categorical variable with 5 values.

This code returns an error that the second variable needs to have a length of 1 rather than 5.

What I want to do is see what the spread of the categorical variable (status) was for the different goal percentiles.

Should I use a different function than group_by or write it differently to get the result that I want?

Thank you.

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

2 Likes

Write your code like this
my_data %>%
group_by(group_duration) %>%
summarise(means = mean(pledged, na.rm = TRUE)).

NOTE:

  1. With tidyverse (dplyr) you don't need the dollar signs, its picking the variables from the my_data.

  2. I am assuming that you have group_duration and pledged variables in the data set (my_data) and also that pledged is of either numeric or integers class.

Best

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.