Difficulty Calculating Percentiles?

omario · December 30, 2022, 6:39pm

I have the following dataset:

library(dplyr)

set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))

my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))

I then have this function used to calculate arbitrary percentiles for variables :

ptile <- function(x, n_percentiles) {
  # Calculate the percentiles
  pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- sprintf("%.2f to %.2f percentile %d",
                    head(pct, -1), tail(pct, -1), seq_len(n_percentiles))

  cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

When I use this function sometimes:

# error not produced on this dataset, but on other datasets

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var1, 5))

I get one of two errors:

Error in cut.default(x, breaks = pct, lables = labels, include.lowest = TRUE): invalid number of intervals
Error in cut.default(x, breaks = pct, labels = labels, include.lowest = TRUE) : 'breaks' are not unique

At first I thought that these errors are being produced because I am using this function on "groups of rows" - and in some of these rows, there might be too few rows for the desired percentile to be calculated?

I had originally thought that perhaps I could fix this problem by excluding "groups" with an insufficient number of rows:

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
filter(n() < 5) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var1, 5))

But the same errors still persist.

I was wondering - is there some way to modify this percentile function such that when percentiles might not be able to be calculated at the desired level, the next closest level of percentiles can be calculated?

As an example, if I want percentiles at groups of 10 and this is not possible - perhaps percentiles at groups of 15 or groups of 20 might be possible?

As another example, suppose some group of observations (e.g. male, immigrant, country A) only has 1 observation and I want percentiles in groups of 10 - without knowing that such a group exists in advance, is it possible to modify this ptile function such that it either ignores this group or just calculates the closest possible percentile (e.g. places everything into 1)?

In general, how can I change this ptile function so that these errors can be fixed?

Can someone please suggest a way to do this?

Thanks!

omario · December 31, 2022, 1:55am

Various attempts to solve this problem on my own:

Attempt 1:

final_answer = my_data %>% group_by(gender, status, country) %>% 
  mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
                            ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
                            ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
                            ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
                            ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
                            ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
                            ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
                            ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
                            ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
                            ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>% 
  mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
                            ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
                            ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
                            ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
                            ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))

Attempt 2:

percentile_classifier <- function(x, n_percentiles) {
  # Calculate the percentiles
  percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- character(length(x))

  # Loop through each percentile and assign the corresponding label to each element in the vector
  for (i in 1:length(percentiles)) {
    lower <- percentiles[i]
    upper <- ifelse(i == length(percentiles), max(x), percentiles[i+1])
    label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
    labels[x >= lower & x < upper] <- label
  }

  # Return the labels
  return(labels)
}

final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10))  %>% mutate(result2 = percentile_classifier(var2, 5))

Anyone have any thoughts? Are these approaches correct?

Thanks!

nirgrahamuk · January 1, 2023, 9:32pm

It would be easier mire convenuent to support you if yiur issues were readily reproducible

error not produced on this dataset, but on other datasets

system · February 12, 2023, 9:33pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.