Issue when using the group_by function in tidyverse package

Ishan16.D · July 29, 2019, 8:27pm

Hello! So I am relatively new to R and have been using the group_by function to work on a dataset and compare variables. I have run into 2 different issues when using the function that may be due to incorrect syntax.

In the first, I turned a numeric variable (the duration of a project) into a factored categorical variable by percentile:

Percentile_00 = min(DSI_kickstarterscrape_dataset$duration)
Percentile_33 = quantile(DSI_kickstarterscrape_dataset$duration, 0.33333)
Percentile_67 = quantile(DSI_kickstarterscrape_dataset$duration, 0.66667)
Percentile_100 = max(DSI_kickstarterscrape_dataset$duration)
RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)

dimnames(RB)[[2]] = "Value"

RB
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_00 & DSI_kickstarterscrape_dataset$duration < Percentile_33] = "Lower_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_33 & DSI_kickstarterscrape_dataset$duration < Percentile_67] = "Middle_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_67 & DSI_kickstarterscrape_dataset$duration <= Percentile_100] = "Upper_third"

Then I used the group_by function to summarise the mean of a variable (pledges for projects) by the percentiles in the GroupDuration variable

mean_by_GroupDuration <- my_data %>%
group_by(DSI_kickstarterscrape_dataset$GroupDuration) %>%
summarize(mean(DSI_kickstarterscrape_dataset$pledged,na.rm = TRUE)) %>%

Calling the mean_by_GroupDuration created a table which showed the same mean pledge for each percentile of the Duration variable. This appears to have worked but does this make sense statistically?

In the second example I ran all of the previous code the same way the only change being swapping the variables. I grouped by a goal variable and then wanted to summarize a categorical variable with 5 values.

This code returns an error that the second variable needs to have a length of 1 rather than 5.

What I want to do is see what the spread of the categorical variable (status) was for the different goal percentiles.

Should I use a different function than group_by or write it differently to get the result that I want?

Thank you.

andresrcs · July 29, 2019, 9:33pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Samuel_Onyango · August 6, 2019, 6:38am

Write your code like this
my_data %>%
group_by(group_duration) %>%
summarise(means = mean(pledged, na.rm = TRUE)).

NOTE:

With tidyverse (dplyr) you don't need the dollar signs, its picking the variables from the my_data.
I am assuming that you have group_duration and pledged variables in the data set (my_data) and also that pledged is of either numeric or integers class.

Best

system · August 27, 2019, 6:38am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.