Hello! So I am relatively new to R and have been using the group_by function to work on a dataset and compare variables. I have run into 2 different issues when using the function that may be due to incorrect syntax.

In the first, I turned a numeric variable (the duration of a project) into a factored categorical variable by percentile:

Percentile_00 = min(DSI_kickstarterscrape_dataset$duration)

Percentile_33 = quantile(DSI_kickstarterscrape_dataset$duration, 0.33333)

Percentile_67 = quantile(DSI_kickstarterscrape_dataset$duration, 0.66667)

Percentile_100 = max(DSI_kickstarterscrape_dataset$duration)

RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)

dimnames(RB)[[2]] = "Value"

RB

DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_00 & DSI_kickstarterscrape_dataset$duration < Percentile_33] = "Lower_third"

DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_33 & DSI_kickstarterscrape_dataset$duration < Percentile_67] = "Middle_third"

DSI_kickstarterscrape_dataset$GroupDuration[DSI_kickstarterscrape_dataset$duration >= Percentile_67 & DSI_kickstarterscrape_dataset$duration <= Percentile_100] = "Upper_third"

Then I used the group_by function to summarise the mean of a variable (pledges for projects) by the percentiles in the GroupDuration variable

mean_by_GroupDuration <- my_data %>%

group_by(DSI_kickstarterscrape_dataset$GroupDuration) %>%

summarize(mean(DSI_kickstarterscrape_dataset$pledged,na.rm = TRUE)) %>%

Calling the mean_by_GroupDuration created a table which showed the same mean pledge for each percentile of the Duration variable. This appears to have worked but does this make sense statistically?

In the second example I ran all of the previous code the same way the only change being swapping the variables. I grouped by a goal variable and then wanted to summarize a categorical variable with 5 values.

This code returns an error that the second variable needs to have a length of 1 rather than 5.

What I want to do is see what the spread of the categorical variable (status) was for the different goal percentiles.

Should I use a different function than group_by or write it differently to get the result that I want?

Thank you.