In this data frame X represents The location, and Y represents the type of offence corresponding to that location.
My aim is to find out the most common offence for a particular type of location and then present the answer of that in a summary table. How can i go about doing this when I have thousands of data points?
For this example, of course the most common offence in croydon is theft. But how can I use code to find this answer.
p.s. I am a new user to R and have only had experience with tidyverse, dply and ggplot2.
Here is a way to do it using the tidyverse. No guarantees that this is the best/most efficient way, but it works!
As above, I'm calling your sample data frame df (thanks for providing reproducible sample data, by the way!
# For clarity, rename X and Y to more descriptive names
df <- structure(list(X = c("Camden", "Bromley", "Croydon", "Bromley",
"Croydon", "Camden"), Y = c("Theft", "Theft", "Theft", "Burglary",
"Theft", "Robbery")), class = "data.frame", row.names = c(NA,
-6L)) %>% rename("location" = X, "offense" = Y)
# Calculate the number of times each offense occurred in each location
tallied <- df %>% group_by(location, offense) %>%
summarize(frequency = n()) %>%
ungroup()
# Filter down to the most common offense for each location
summ <- tallied %>%
group_by(location) %>%
filter(frequency == max(frequency)) %>%
ungroup()
You could definitely combine the steps above into a single pipeline if you prefer; I just left them separate so it's a little easier to read.
Note also that my solution above preserves both answers if there are ties. If you want to select just a single offense in the case of a tie, you might try using slice() to take e.g. only the first or only the last row after grouping by location.