Hello! So I am relatively new to R and have been using the group_by function to work on a dataset and compare variables. I have run into 2 different issues when using the function that may be due to incorrect syntax.
In the first, I turned a numeric variable (the duration of a project) into a factored categorical variable by percentile:
Percentile_00 = min(DSI_kickstarterscrape_dataset$duration)
Percentile_33 = quantile(DSI_kickstarterscrape_dataset$duration, 0.33333)
Percentile_67 = quantile(DSI_kickstarterscrape_dataset$duration, 0.66667)
Percentile_100 = max(DSI_kickstarterscrape_dataset$duration)
RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)
dimnames(RB)[[2]] = "Value"
RB
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_00 & DSI_kickstarterscrape_dataset$duration < Percentile_33] = "Lower_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_33 & DSI_kickstarterscrape_dataset$duration < Percentile_67] = "Middle_third"
DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_67 & DSI_kickstarterscrape_dataset$duration <= Percentile_100] = "Upper_third"
Then I used the group_by function to summarise the mean of a variable (pledges for projects) by the percentiles in the GroupDuration variable
mean_by_GroupDuration <- my_data %>%
group_by(DSI_kickstarterscrape_dataset$GroupDuration) %>%
summarize(mean(DSI_kickstarterscrape_dataset$pledged,na.rm = TRUE)) %>%
Calling the mean_by_GroupDuration created a table which showed the same mean pledge for each percentile of the Duration variable. This appears to have worked but does this make sense statistically?
In the second example I ran all of the previous code the same way the only change being swapping the variables. I grouped by a goal variable and then wanted to summarize a categorical variable with 5 values.
This code returns an error that the second variable needs to have a length of 1 rather than 5.
What I want to do is see what the spread of the categorical variable (status) was for the different goal percentiles.
Should I use a different function than group_by or write it differently to get the result that I want?
Thank you.