How to analyase ranking data in R please?

Hello there,

I am new R user and you will probably realise this from the silly questions that I ask—I'll apologise for this now.

I have an Excel CSV file that I have read into R and am doing some descriptive statistics on this data set. The data set has 264 observations (i.e. 364 rows) and 50 variables (i.e. 50 columns).

I have 2 ranking questions in this data set.
1). The first ranking question asks the 264 people to rank the following: model, location, education, fee, income; with 1 being the most important and 5 being the least important (can't assign the same number to more that one option). The whole 50 column data set contains these 5 variables in 5 different columns in the data set with the same column names.
I would like to please work out the mean average ranking for each of the 5 variables (i.e. model, location, education, fee, income) to work out the overall importance rankings for the 5 variables in the data set i.e. the most important variable down to the least important variable. I know that the mean average ranking can be worked out as the sum of (the ranking numbers (column headers) times their respective count frequency/proportions).

2). The second ranking question asks the 264 people to rank the following colour variables: red, blue, green, yellow, black; with 1 being the most preferred colour and 5 being the least preferred colour (can't assign the same number to more that one option). The whole 50 column data set contains these 5 variables in 5 different columns in the data set with the same column names.
I would like to please work out the mean average ranking for each of the 5 variables (i.e. red, blue, green, yellow, black) to work out the overall preference ranking for these 5 variables in the data set i.e the most preferred colour down to the least preferred colour. Again I would want to use the mean average ranking that is worked out as the sum of (the ranking numbers (column headers) times their respective count frequency/proportions).

My beginner level is showing here, sorry.
Firstly, I'm not even sure how to isolate the required variables to work out each of my 2 ranking questions.
Secondly, I don't know what R package or function to use to work out the 2 ranking questions.

Any help is gratefully accepted and appreciated please.

Many thanks.

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Hi andresrcs,

I have tried to prepare a reprex below.
Again my beginner level may mean that my reprex is not great.
Also I didn't include any R code in the reprex as this is part of my problem i.e. I don't know what R package or function to use to get the mean average rank for my 2 ranking questions below.

#This is an example of my data set i.e 1st 5 respondents for whole data set:
head (mydata)
#> Error in head(mydata): object 'mydata' not found
data.frame(
                         Progress = c(100L, 100L, 100L, 100L, 100L),
            Duration..in.seconds. = c(1770L, 1030L, 644L, 3988L, 1292L),
                               Id = c(1L, 2L, 3L, 4L, 5L),
                            model = c(4L, 2L, 1L, 3L, 2L),
                         location = c(1L, 3L, 2L, 2L, 3L),
                        education = c(3L, 1L, 3L, 1L, 1L),
                              fee = c(2L, 5L, 4L, 5L, 4L),
                           income = c(5L, 4L, 5L, 4L, 5L),
                              red = c(4L, 1L, 2L, 2L, 4L),
                             blue = c(3L, 2L, 1L, 1L, 3L),
                            green = c(1L, 4L, 4L, 3L, 2L),
                           yellow = c(2L, 5L, 3L, 4L, 1L),
                            black = c(5L, 3L, 5L, 5L, 5L),
                              Age = c(47L, 47L, 51L, 50L, 38L),
                    Recorded.Date = as.factor(c("15/06/2018 21:29",
                                                "16/06/2018 15:47",
                                                "18/06/2018 19:07", "19/06/2018 20:29",
                                                "20/06/2018 13:59")),
                              RID = as.factor(c("R_Djkev4OH9F3RuIp",
                                                "R_2vY3qfyS8vNWvCH",
                                                "R_1Rr1Eh9iCI3wznj", "R_T1rPDENUBBntTCF",
                                                "R_3inja17CkIpsjHr")),
                     Distribution = as.factor(c("anonymous", "anonymous",
                                                "anonymous", "anonymous",
                                                "anonymous")),
                            Block = as.factor(c("A", "C", "A", "C", "B")),
   Difficulty.of.choice.questions = as.factor(c("Moderately easy",
                                                "Moderately easy",
                                                "Extremely difficult", "Extremely difficult",
                                                "Extremely difficult")),
                           Gender = as.factor(c("Female", "Male", "Female",
                                                "Female", "Female")),
                        Ethnicity = as.factor(c("Other", "European",
                                                "European", "Other",
                                                "European")),
        Current.job.satisfaction4 = as.factor(c("Somewhat dissatisfied",
                                                "Extremely satisfied",
                                                "Somewhat satisfied",
                                                "Somewhat satisfied", "Somewhat satisfied")),
        Current.job.satisfaction2 = as.factor(c("Dissatisfied", "Satisfied",
                                                "Satisfied", "Satisfied",
                                                "Satisfied"))
)
#>   Progress Duration..in.seconds. Id model location education fee income
#> 1      100                  1770  1     4        1         3   2      5
#> 2      100                  1030  2     2        3         1   5      4
#> 3      100                   644  3     1        2         3   4      5
#> 4      100                  3988  4     3        2         1   5      4
#> 5      100                  1292  5     2        3         1   4      5
#>   red blue green yellow black Age    Recorded.Date               RID
#> 1   4    3     1      2     5  47 15/06/2018 21:29 R_Djkev4OH9F3RuIp
#> 2   1    2     4      5     3  47 16/06/2018 15:47 R_2vY3qfyS8vNWvCH
#> 3   2    1     4      3     5  51 18/06/2018 19:07 R_1Rr1Eh9iCI3wznj
#> 4   2    1     3      4     5  50 19/06/2018 20:29 R_T1rPDENUBBntTCF
#> 5   4    3     2      1     5  38 20/06/2018 13:59 R_3inja17CkIpsjHr
#>   Distribution Block Difficulty.of.choice.questions Gender Ethnicity
#> 1    anonymous     A                Moderately easy Female     Other
#> 2    anonymous     C                Moderately easy   Male  European
#> 3    anonymous     A            Extremely difficult Female  European
#> 4    anonymous     C            Extremely difficult Female     Other
#> 5    anonymous     B            Extremely difficult Female  European
#>   Current.job.satisfaction4 Current.job.satisfaction2
#> 1     Somewhat dissatisfied              Dissatisfied
#> 2       Extremely satisfied                 Satisfied
#> 3        Somewhat satisfied                 Satisfied
#> 4        Somewhat satisfied                 Satisfied
#> 5        Somewhat satisfied                 Satisfied


#This is the subset that I want to use to work out my first ranking question i.e. ranking of the 5 variables: model,
#location, education, fee, income (with 1 being most important to 5 being least important)
#I don't know how to get this subset from my whole data set to work out the ranking question here:
head (mydata, 5)[, c('Id', 'model', 'location', 'education', 'fee', 'income')]
#> Error in head(mydata, 5): object 'mydata' not found
data.frame(
          Id = c(1L, 2L, 3L, 4L, 5L),
       model = c(4L, 2L, 1L, 3L, 2L),
    location = c(1L, 3L, 2L, 2L, 3L),
   education = c(3L, 1L, 3L, 1L, 1L),
         fee = c(2L, 5L, 4L, 5L, 4L),
      income = c(5L, 4L, 5L, 4L, 5L)
)
#>   Id model location education fee income
#> 1  1     4        1         3   2      5
#> 2  2     2        3         1   5      4
#> 3  3     1        2         3   4      5
#> 4  4     3        2         1   5      4
#> 5  5     2        3         1   4      5



#This is the subset that I want to use to work out my second ranking question i.e. ranking of the 5 colour variables: 
#red, blue, green, yellow, black (with 1 being most preferred to 5 being least preferred).
#Again, I don't know how to get this subset from my whole data set to work out the ranking question here:
head (mydata, 5)[, c('Id', 'red', 'blue', 'green', 'yellow', 'black')]
#> Error in head(mydata, 5): object 'mydata' not found
data.frame(
          Id = c(1L, 2L, 3L, 4L, 5L),
         red = c(4L, 1L, 2L, 2L, 4L),
        blue = c(3L, 2L, 1L, 1L, 3L),
       green = c(1L, 4L, 4L, 3L, 2L),
      yellow = c(2L, 5L, 3L, 4L, 1L),
       black = c(5L, 3L, 5L, 5L, 5L)
)
#>   Id red blue green yellow black
#> 1  1   4    3     1      2     5
#> 2  2   1    2     4      5     3
#> 3  3   2    1     4      3     5
#> 4  4   2    1     3      4     5
#> 5  5   4    3     2      1     5


# I would like to use mean average ranking i.e. the sum of (the ranking numbers times their respective count frequency/proportions) for these 2 ranking questions.
#1). Firstly, I don't know how to isolate the required variables to work out each of my 2 ranking questions from my whole data set.
#Secondly, I don't know what R package or function to use to work out the mean average ranking for these 2 ranking questions.
#Thank you
1 Like

Here is an example of working with your data. It uses the functions select(), gather(), group_by() and summarize()

  • select() chooses columns.
  • gather() combines many columns into two, one column to label the data and one to store the value
  • group_by() makes subsets of the data, one subset for each value of the chosen column(s).
  • summarize() does calculations on the subsets created by group_by()

Look at the intermediate data frames at each calculation step and see if it makes sense. This is a lot to digest at one time, so do not be surprised if you are confused.

library(dplyr)
library(tidyr)
Dat <- data.frame(
  Progress = c(100L, 100L, 100L, 100L, 100L),
  Duration.in.seconds. = c(1770L, 1030L, 644L, 3988L, 1292L),
  Id = c(1L, 2L, 3L, 4L, 5L),
  model = c(4L, 2L, 1L, 3L, 2L),
  location = c(1L, 3L, 2L, 2L, 3L),
  education = c(3L, 1L, 3L, 1L, 1L),
  fee = c(2L, 5L, 4L, 5L, 4L),
  income = c(5L, 4L, 5L, 4L, 5L),
  red = c(4L, 1L, 2L, 2L, 4L),
  blue = c(3L, 2L, 1L, 1L, 3L),
  green = c(1L, 4L, 4L, 3L, 2L),
  yellow = c(2L, 5L, 3L, 4L, 1L),
  black = c(5L, 3L, 5L, 5L, 5L),
  Age = c(47L, 47L, 51L, 50L, 38L),
  Recorded.Date = as.factor(c("15/06/2018 21:29",
                              "16/06/2018 15:47",
                              "18/06/2018 19:07", "19/06/2018 20:29",
                              "20/06/2018 13:59")),
  RID = as.factor(c("R_Djkev4OH9F3RuIp",
                    "R_2vY3qfyS8vNWvCH",
                    "R_1Rr1Eh9iCI3wznj", "R_T1rPDENUBBntTCF",
                    "R_3inja17CkIpsjHr")),
  Distribution = as.factor(c("anonymous", "anonymous",
                             "anonymous", "anonymous",
                             "anonymous")),
  Block = as.factor(c("A", "C", "A", "C", "B")),
  Difficulty.of.choice.questions = as.factor(c("Moderately easy",
                                               "Moderately easy",
                                               "Extremely difficult", "Extremely difficult",
                                               "Extremely difficult")),
  Gender = as.factor(c("Female", "Male", "Female",
                       "Female", "Female")),
  Ethnicity = as.factor(c("Other", "European",
                          "European", "Other",
                          "European")),
  Current.job.satisfaction4 = as.factor(c("Somewhat dissatisfied",
                                          "Extremely satisfied",
                                          "Somewhat satisfied",
                                          "Somewhat satisfied", "Somewhat satisfied")),
  Current.job.satisfaction2 = as.factor(c("Dissatisfied", "Satisfied",
                                          "Satisfied", "Satisfied",
                                          "Satisfied"))
  )

#Calculate means
Columns1 <- Dat %>% select(model:income)
Col1_tall <- Columns1 %>% gather(key = Feature, value = Rank, model:income)
Stats1 <- Col1_tall %>% group_by(Feature) %>% summarize(Avg = mean(Rank)) 
Stats1
#> # A tibble: 5 x 2
#>   Feature     Avg
#>   <chr>     <dbl>
#> 1 education   1.8
#> 2 fee         4  
#> 3 income      4.6
#> 4 location    2.2
#> 5 model       2.4

Colors <- Dat %>% select(red:black)
Colors_tall <- Colors %>% gather(key = Color, value = Rank, red:black)
ColorStats <- Colors_tall %>% group_by(Color) %>% summarize(Avg = mean(Rank)) 
ColorStats
#> # A tibble: 5 x 2
#>   Color    Avg
#>   <chr>  <dbl>
#> 1 black    4.6
#> 2 blue     2  
#> 3 green    2.8
#> 4 red      2.6
#> 5 yellow   3

Created on 2019-09-17 by the reprex package (v0.2.1)

1 Like

Hello FJCC,

Wow, thank you so much for the help.
I have tried this script and yes it works for my data. I am able to work out importance ranking for the attributes and preference ranking for the colours.

I do have another question please.
Now that we have identified the rankings for the 2 ranking questions.
What is the appropriate statistical test to use next please to determine if there is actually a difference in the respondents' ranking?
i.e.: H0: There is no difference in the respondents’ ranks.
H1: There is a difference in the respondents’ ranks.
How would I do run this appropriate statistical test in R please?

Again, any help/feedback is appreciated.

Thank you.

Hi @R19,

you need to be a bit more precise about your null hypothesis - do you mean testing differences in the mean of the respondent's ranks or some other summary? If it is the mean (and in fact for any t-test), you can check the function t.test.

1 Like

Hi Valeri,

Thank you for the reply.

I think I am wanting to check if there is a difference between any of the respondent ranks. I have read some info that relates to testing if the the ranking is uniform i.e. another way of testing if there actually is a difference in the ranking. However, I'm not very familiar with this concept and am not sure of the name of this type of test or how to do this is R. Possibly this could be done via a chi square test to test if the ranking is uniformly distributed but again my beginner level means I am not sure how to actually do this in R with my data set for the 2 ranking questions.

Any help/suggestions are appreciated.

Thank you.

Hello again FJCC,

Again, thank you very much for your help.

As the average mean rank for the variables in the 2 ranking questions are quite close together for some, I think that I may get some overlapping confidence intervals for the average mean ranking scores.

The problem is that I don't know how to calculate the 95% confidence intervals in R for the rankings solutions that you provided above? Are you able to please help with this?

Many thanks.

The Standard Error (SE) of the mean can be estimated by

data_standard_deviation/n^0.5

were n is the number of data points. You can calculate the standard deviation of the data at the same time as the average by changing the summarize function to

summarize(Avg = mean(Rank), StdDev = sd(Rank))

The 95% confidence limits can be calculated from

Mean +/- 1.96 * SE

I have never had to work with ranked data, so I may have missed some important aspect of the analysis.

Hi FJCC,

Thank you for the reply and advice. I've never worked with ranked data either, hence all my questions.

I've been able to work out the 95% confidence intervals using the Rmisc package and the CI function. I run the CI function on each of the variables and for each variable it displays the average mean of the variable (which is used to rank the variable) and the upper and lower level of the 95% CI of each variable.

As I mentioned in my reply to Valeri, I would like to now test if the ranking is uniform for the set of attribute variables i.e. another way of testing if there actually is a difference in the ranking. I would also like to test this for the set of colour variables. However, I don't how to do this is R. I think that this can be done via a chi square test to test if the ranking is uniformly distributed but again my beginner level means I don't know how to actually do this in R with the variables in the 2 ranking questions in my data set.

If anyone can please help with this, it is much appreciated.

Thank you.

I am still not quite sure what is the formal null hypothesis here. I have an example which may help you if that is the type of hypothesis you are trying to test. In my case I have 2 groups - A and B - could be different genders, age groups, etc. They have some data (responses) associated with them - could be answers, their blood pressure, anything really. I have simply drawn some random numbers with a different mean in my example. Now, my test here is that there is no difference in the mean response: H_0: \mu_A = \mu_B, where \mu is the population mean. You would then test this as follows:

set.seed(1)

responses_A <- rnorm(100, 0, 1) # Group A has a mean 0
responses_B <- rnorm(100, 1, 1) # Group B has a mean 1

t_test <- t.test(responses_A, responses_B)
t_test
#> 
#>  Welch Two Sample t-test
#> 
#> data:  responses_A and responses_B
#> t = -6.4983, df = 197.19, p-value = 6.492e-10
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -1.1122615 -0.5943477
#> sample estimates:
#> mean of x mean of y 
#> 0.1088874 0.9621919

The test statistic is t=-6.4983 with a p-value = 6.492e-10, so in this case as expected the means in groups A and B are statistically different (at any "normal" confidence level), i.e., the null hypothesis is rejected.

A post was split to a new topic: Test for uniform ranking (Likert-type data)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.