How can I create a column that displays what percentile of the data is a value in a different column of that dataframe

So my data looks like this, with # of rows = 6000 approx:

        pidp      avgy06
    1 68160489 20182.36849
    2 68575973 13845.49024
    3 69180553    35.61806
    4 69786365 13117.26465
    5 69815605 15791.40283
    6 69833973 10327.94531

I would like to know if there's a way to apply the quantile() function, so as to add another column that gives me what percentile of the data is each value of avgy06 . For example, this is the 100 percentiles of avgy06 :

      0%         1%         2%         3%         4%         5%         6%         7%         8%         9%        10% 
    0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000     0.0000   189.0078 
       11%        12%        13%        14%        15%        16%        17%        18%        19%        20%        21% 
  790.2671  1505.4875  2364.4903  2900.0230  3441.0689  3680.2787  4246.6805  4595.0131  4704.8372  4904.6381  5217.9201 
       22%        23%        24%        25%        26%        27%        28%        29%        30%        31%        32% 
 5421.2263  5621.4581  6166.7022  6673.1660  6851.0085  7261.1324  7588.7569  7947.6250  8292.3789  8606.2774  8938.2232 
       33%        34%        35%        36%        37%        38%        39%        40%        41%        42%        43% 
 9286.9695  9665.7901  9885.2171 10035.7984 10280.0676 10423.1376 10633.2589 10886.2913 11205.7540 11411.0259 11581.6681 
       44%        45%        46%        47%        48%        49%        50%        51%        52%        53%        54% 
11763.5549 11926.4006 12210.2935 12434.3433 12581.4526 12781.9956 13135.6904 13305.6350 13666.1352 13814.4657 14046.4000 
       55%        56%        57%        58%        59%        60%        61%        62%        63%        64%        65% 
14258.2219 14431.6258 14631.6608 14940.7309 15168.2559 15385.1055 15583.7370 15757.0793 15906.4169 16094.3642 16448.5898 
       66%        67%        68%        69%        70%        71%        72%        73%        74%        75%        76% 
16683.5195 16817.0613 17049.2498 17361.5975 17663.5911 18004.6763 18309.8879 18614.3184 18871.4102 19220.2478 19529.0051 
       77%        78%        79%        80%        81%        82%        83%        84%        85%        86%        87% 
19962.8668 20249.0984 20526.2794 20690.6686 20896.2913 21135.7998 21396.8414 21763.6818 22070.5915 22494.2696 23000.0000 
       88%        89%        90%        91%        92%        93%        94%        95%        96%        97%        98% 
23486.0340 24206.6486 25106.3743 26261.0410 26593.7715 27402.7684 28079.6456 28910.4655 30315.5573 32447.8075 39225.6094 
       99%       100% 
41759.9540 57456.0758 

I would like an extra column in my dataframe that reads the value of avgy06 and returns to what percentile of the data does such value correspond to (e.g 75th, 63th...). If there's another way without using the quantile() function, let me know.

Many thanks!

You could use rank() to get this. If you divide the rank() of the value by how many there are, it should give you the quantiles. If you data.frame is called df, you can use

df$quantiley06 = rank(df$avgy06)/nrow(df)
1 Like

I think what you are looking for is the empirical cumulative distribution function. ecdf() generates a distribution function, which you can then pass the same vector back into to get the actual quantiles.

df$quantile <- ecdf(df$avgy06)(df$avgy06)
1 Like

Thanks a lot for the solution. If I can just ask, why is the second (df$avgy06) needed in that expression?

ecdf() returns a function (the cumulative distribution function). To get the quantiles themselves you need to pass the specific values back into it. Look at the documentation to understand exactly what it does and how it is used.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.