Iterate Calculations for Multiple Data Tables to New Data Table

I am working with some sensitive student data so I apologize for not posting it here. I will explain my process here though. I am new to R so I would expect that my process probably needs some revision in order to make it more efficient. Therefore, I am very open to changing the methods here.

I have nine datasets for each year of several that I am working with. I would like to calculate several statistics from each data table and store it in a new data table. Here is a list of the names of each data table for the year 2018-19 which will be the rows:

grtbl1819
grtblF1819
grtblS1819
grtblF1819reqs
grtblS1819reqs
grtblF1819elec
grtblS1819elec
grtblF1819APs
grtblS1819APs

Each of these data tables have the same variables, I am mainly using the "percent" variable for most of these calculations. I would like to store the results of these calculations in a new data table (rows represent each data table listed above and columns represent these calculations). This list below is the calculations for just the first data table above. These below will be the columns.

no_grtbl1819 <- nrow(grtbl1819)
IQR_grtbl1819 <- mean(grtbl1819$percent)
25perc_grtbl1819 <- quantile(grtbl1819$percent)[2]
75perc_grtbl1819 <- quantile(grtbl1819$percent)[4]
extlow_grtbl1819 <- 25perc_grtbl1819 - (1.5 * IQR_grtbl1819)
outliers_grtbl1819 <- filter(grtbl1819, percent <= extlow_grtbl1819)
nooutliers_grtbl1819 <- nrow(outliers_grtbl1819)
meanoutliers_grtbl1819 <- mean(outliers_grtbl1819$percent)

What would be the best approach to produce a new data table with these new variables for each row listed above? In the long run, I also want to include other year's data in this data table as well.

I think we need some sample data, perhaps a couple of very simple mock data tables and a bit of code. At the moment we have no idea of what format the data is in (tibble, data.frame, list, matrix?) and I am not sure that some of your code will run. For example, I do not belief that one can start a variable name with a numeric.

25perc_grtbl1819 <- quantile(grtbl1819$percent)[2]

There are probably at least four or five ways to approach the problem but we need a bit more information about the data structure in particular. Come to think of it, are the data tables even currently in an R format or do we need to read them in?

1 Like

Fair enough. I cleaned any individual identifiers off the data table and loaded everything into this workspace. I created all the data tables for school year 18-19 using the "TidyTheData.R" script. I began making the variables for the table I want to build in the other script.

I am completely open to new ways of building the entire project, including redoing the first script.

It looks like I managed to grab the ,csv data file and the TestGrades.r file and I'll have a look but it's about 17:30 here so I may not get much done until tomorrow.

With any luck some real R guru may have the perfect solution before then :slight_smile:

John

For some reason you did not rename “grading.termName' so I did “term = grading.termName”.

I do not understand what you are doing here.


IQR_grtbl1819 <- mean(grtbl1819$percent)

extlow_grtbl1819 <- 25perc_grtbl1819 - (1.5 * IQR_grtbl1819)

An IQR is not calculated using a mean.

I am having a problem with


grtblF1819elec = filter(grtblF1819, courseNo == "SST420" | courseNo == "SST422" | courseNo == "SST430" | courseNo == "SST408" | courseNo == "SST325" | courseNo == "SST345" | courseNo == "SST306" | courseNo == "SST425" | courseNo == "SST451") grtblS1819elec = filter(grtblS1819, courseNo == "SST421" | courseNo == "SST423" | courseNo == "SST431" | courseNo == "SST409" | courseNo == "SST325" | courseNo == "SST345" | courseNo == "SST306" | courseNo == "SST425" | courseNo == "SST451") 

grtblF1819APs = filter(grtblF1819, courseNo == "SST420" | courseNo == "SST422" | courseNo == "SST430" | courseNo == "SST408" | courseNo == "SST451") grtblS1819APs = filter(grtblS1819, courseNo == "SST421" | courseNo == "SST423" | courseNo == "SST431" | courseNo == "SST409" | courseNo == "SST451") 

It appears that grtblF1819elec and grtblF1819APs are the same courses. Are elecs and APs actually the same?

I was trying to recode some of the data in hopes of not having to do as many tables but discovered that AP and electives appear ta be the same.

It seemed to me that it would be better to keep the raw data in one data.frame/tibble. Here is what I was playing around with when I suddenly discovered this,

library(tidyverse); library(reshape2); library(lubridate); 
## read data in as a tibble

dat1 <- read_csv("grades1819.csv")

dat1 <- rename(dat1,
                 ID = student.enrollmentID,
                 gender = student.gender,
                 grade = student.grade,
                 course_no = grading.courseNumber,
                 course_name = grading.courseName,
                 teacher = grading.teacherDisplay,
                 term = grading.termName,
                 letter_grd = grading.progressScore,
                 percent = grading.progressPercent,
)



# ID is not a numeric so I converted it to "character".
dat1$ID  <-  as.character(dat1$ID)


# generate a few stats


dat1 %>%  
            group_by(term, course_no)   %>%  
             summarise(n = n(), mean = mean(percent), sd = sd(percent), median = median(percent),  IQR = IQR(percent),
                       Upper =  quantile(percent)[4] + 1.5 * IQR(percent), Lower =  quantile(percent)[2] - 1.5 * IQR(percent) )
 
# create course status variable


 dat1  <-   dat1   %>%  mutate(status = case_when(course_no == "SST110" | course_no == "SST210" | course_no == "SST310" | course_no == "SST415" |
       course_no == "SST111" | course_no == "SST211" | course_no == "SST311" | course_no == "SST415" ~ "req",
       course_no == "SST420" | course_no == "SST422" | course_no == "SST430" | course_no == "SST408" | course_no == "SST325" | course_no == "SST345" |
       course_no == "SST306" | course_no == "SST425" | course_no == "SST451"|course_no == "SST421" | course_no == "SST423" | course_no == "SST431" | 
       course_no == "SST409" |course_no == "SST325" | course_no == "SST345" | course_no == "SST306" | course_no == "SST425" | course_no == "SST451"~ "elec",)
       )
 

Every R problem can be thought of to advantage as the interaction of three objects: an existing object, x , a desired object, y , and a function,f that will return a value of y given x as an argument. In R everything is an object and objects may contain other objects. Functions are first class objects, just as in f(g(x)).

With f(x)=y in mind, this problem has

x, a data frame of student records arranged in a data frame by course and semester and, optionally year. If there are separate data frames, additional preprocessing will be needed to use the script below.

y, a data frame containing summary statistics by course and semester for n, the number of students with grading.progressPercent below a cutoff and the mean of grading.progressPercent for such students. The cutoff is based on those values greater than the second quartile minus 1.5 times the interquartile difference of the fourth and second quartile. NOTE: this is the Tukey IQR implemented in the fivenum function. The IQR function uses a different algorithm. Cf. the difference between the boxplot shown from ggplot2::geom_boxplot with that returned by the boxplot function.

f could be created b composing the steps below into a single function, but it is preferable to leave them standalone to be able to follow the composition step by step.

In outline:

  1. Read in x and discard variables not relevant to the task
  2. Create a composite course/semester variable
  3. Determine the cutoff for classifying the variable
  4. Identify the percent values less than that cutoff
  5. Take the mean of those values
  6. Roll everything into a single data frame, keeping a list for each course/semester combination of the percent values

An intermediate byproduct of the aggregation method is the creation of multi-level objects. Hence, the ugly subset operators. It will be worthwhile reviewing the syntax to see when to use object[row,column], object[element], object[row,] (all columns) and object[[1]], etc. Once internalized, it simplifies this class of problem.

There is an unresolved situation with course_nest[5,] that should be hand-checked.

suppressPackageStartupMessages({
  library(dplyr)
  library(ggplot2)
  library(purrr)
  library(tibble)
})

# identify points more than 1.5 IQR less than lower hinge, using
# Tukey method

get_cutoff <- function(x) {
  fivenum(x)[2] - 
    (fivenum(x)[4] - fivenum(x)[2]) * 1.5
}

get_course_cutoff <- function(x) get_cutoff(course_nest$data[[x]][[1]])

get_out_mean <- function(x) {
  mean(course_nest$data[[x]][[1]][which(course_nest$data[[x]][[1]] < course_nest[3][[1]][x])], na.rm = TRUE)
}

get_out_no <- function(x) {
  length(which(course_nest$data[[x]][[1]] < course_nest[3][[1]]))
}

# data downloaded from "this workspace by OP" 

readr::read_csv("/home/roc/Desktop/grist.csv") %>% 
  mutate(course = 
          paste0(grading.courseNumber,"_",grading.termName)) %>% 
  rename(percent = grading.progressPercent) %>% 
  select(course, percent) -> x
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   student.enrollmentID = col_double(),
#>   student.gender = col_character(),
#>   student.grade = col_double(),
#>   grading.courseNumber = col_character(),
#>   grading.courseName = col_character(),
#>   grading.teacherDisplay = col_character(),
#>   grading.termName = col_character(),
#>   grading.progressScore = col_character(),
#>   grading.progressPercent = col_double()
#> )

# display global outliers

ggplot(x,aes(percent)) + 
  geom_boxplot() + 
  coord_flip() + 
  theme_minimal()

# extract lower outlier threshold of all records, if needed

global_cut_off <- get_cutoff(x$percent)


x %>% nest_by(course) -> course_nest
map(1:nrow(course_nest),get_course_cutoff) %>% 
  unlist() %>% 
  add_column(cutoff =., course_nest) -> course_nest
map(1:nrow(course_nest),get_out_no) %>% 
  unlist() %>% 
  add_column(n =., course_nest) -> course_nest
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length

#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
map(1:nrow(course_nest),get_out_mean) %>% 
  unlist() %>% 
  add_column(mu =., course_nest) -> course_nest
course_nest %>% mutate(mu = ifelse(is.nan(mu),NA,mu)) -> y

y %>% print(n = Inf)
#> # A tibble: 23 x 5
#> # Rowwise:  course
#>    course                  data cutoff     n    mu
#>    <chr>     <list<tbl_df[,1]>>  <dbl> <int> <dbl>
#>  1 SST110_S1          [440 × 1]   34.9    37  27.2
#>  2 SST111_S2          [431 × 1]   37.9    32  27.6
#>  3 SST210_S1          [339 × 1]   41.0    24  30.4
#>  4 SST211_S2          [335 × 1]   38.0    28  22.1
#>  5 SST306_S2           [13 × 1]  -49.6     6  NA  
#>  6 SST310_S1          [198 × 1]   43.7    11  30.8
#>  7 SST311_S2          [204 × 1]   43.1    15  26.9
#>  8 SST325_S1           [58 × 1]   59.6     0  54.2
#>  9 SST325_S2           [88 × 1]   38.4     5  27.9
#> 10 SST345_S1           [62 × 1]   42.8     3  38.1
#> 11 SST345_S2           [28 × 1]   51.7     0  NA  
#> 12 SST408_S1           [62 × 1]   61.0     1  48.2
#> 13 SST409_S2           [61 × 1]   60.3     2  37.0
#> 14 SST415_S1           [92 × 1]   42.4     4  34.2
#> 15 SST415_S2          [162 × 1]   48.7     3  21.5
#> 16 SST420_S1           [25 × 1]   45.3     1  38.8
#> 17 SST421_S2           [23 × 1]   38.7     1  NA  
#> 18 SST422_S1           [57 × 1]   53.1     3  40.3
#> 19 SST423_S2           [47 × 1]   63.6     2  51.2
#> 20 SST425_S1           [38 × 1]   63.7     0  NA  
#> 21 SST430_S1           [38 × 1]   60.9     0  57.4
#> 22 SST431_S2           [39 × 1]   60.1     0  58.2
#> 23 SST451_S1           [41 × 1]   59.1     0  NA

Created on 2020-12-29 by the reprex package (v0.3.0.9001)

1 Like

I did not rename "grading.termName" because I delete that variable later when the tables are split by term.

As for the IQR and lower quartile calculations, I caught that after I posted but not before you downloaded the script. I corrected that as I accidentally copied my formulas incorrectly. The workspace now reflects the change.

AP courses are electives so they are included in that group but there are non-AP electives as well.

Just to clarify, "cutoff" is 1.5 x IQR below the 25th percentile and "n" is the number of outliers, correct?

That's right for both. BTW, these are calculated separately for each course/semester record; the global is only for reference.

Does this give you some of what you want? Note from the boxpot technocrat supplied there does not seem any outliers above the upper hinge.

dat1 <- read_csv("grades1819.csv")

dat1 <- rename(dat1,
                 ID = student.enrollmentID,
                 gender = student.gender,
                 grade = student.grade,
                 course_no = grading.courseNumber,
                 course_name = grading.courseName,
                 teacher = grading.teacherDisplay,
                 term = grading.termName,
                 letter_grd = grading.progressScore,
                 percent = grading.progressPercent,
)



# ID is not a numeric so I converted it to "character".
dat1$ID  <-  as.character(dat1$ID)


# generate a few stats
## edited below to correct a typo,

dat2 <-  dat1 %>%  
            group_by(term, course_no)   %>%  
             summarise(n = n(), mean = mean(percent), sd = sd(percent), median = median(percent),  IQR = IQR(percent),
                       Upper =  quantile(percent)[4] + 1.5 * IQR(percent), Lower =  quantile(percent)[2] - 1.5 * IQR(percent),
                       N_Below = length(which(percent < Lower)
                       )
             )

dat2   %>%  print(n = Inf)

BTW my thanks to Technocrat for that print statement.

1 Like

This is definitely what I needed. I will dig in to this a little more but I think I understand most everything so that I can replicate this for future years and even add more variables if that becomes necessary.

Yes, there will not be any extreme outliers above the range as very few students earn any scores above 100%.

Thank you both Jrkrideau and Technocrat for your guidance!

1 Like

I realised earlier today that there was an error in the program I supplied yesterday.

I seemed to have copied a bit of an early debugging version. I corrected it today but just to make sure earlier readers are aware the corrected version.
The corrected verrsion should read

library(tidyverse); library(reshape2); library(lubridate);
## read data in as a tibble

dat1 <- read_csv("grades1819.csv")

dat1 <- rename(dat1,
                 ID = student.enrollmentID,
                 gender = student.gender,
                 grade = student.grade,
                 course_no = grading.courseNumber,
                 course_name = grading.courseName,
                 teacher = grading.teacherDisplay,
                 term = grading.termName,
                 letter_grd = grading.progressScore,
                 percent = grading.progressPercent,
)



# ID is not a numeric so I converted it to "character".
dat1$ID  <-  as.character(dat1$ID)


# generate a few stats



dat2 <-  dat1 %>%  
            group_by(term, course_no)   %>%  
             summarise(n = n(), mean = mean(percent), sd = sd(percent), median = median(percent),  IQR = IQR(percent),
                       Upper =  quantile(percent)[4] + 1.5 * IQR(percent), Lower =  quantile(percent)[2] - 1.5 * IQR(percent),
                       N_Below = length(which(percent < Lower)
                       )
             )

dat2   %>%  print(n = Inf)


The last calculation that I need to make is to determine the number of As, Bs, Cs, Ds and Fs in each course. Essentially, I just want to build on the suggestions here to add additional columns to the data table course_nest.

Technocrat created this code to build a new table, course_nest, that includes data by course number:

x %>% nest_by(course) -> course_nest
map(1:nrow(course_nest),get_course_cutoff) %>%
  unlist() %>%
  add_column(cutoff =., course_nest) -> course_nest
map(1:nrow(course_nest),get_out_no) %>%
  unlist() %>%
  add_column(n =., course_nest) -> course_nest

Output:

# A tibble: 23 x 5
# Rowwise:  course
   course                  data cutoff     n    mu
   <chr>     <list<tbl_df[,1]>>  <dbl> <int> <dbl>
 1 SST110_S1          [440 x 1]   34.9    28  27.2
 2 SST111_S2          [431 x 1]   37.9    28  27.6
 3 SST210_S1          [339 x 1]   41.0    18  30.4
 4 SST211_S2          [335 x 1]   38.0    23  22.1
 5 SST306_S2           [13 x 1]  -49.6     6 NaN  
 6 SST310_S1          [198 x 1]   43.7     9  30.8
 7 SST311_S2          [204 x 1]   43.1    12  26.9
 8 SST325_S1           [58 x 1]   59.6     1  54.2
 9 SST325_S2           [88 x 1]   38.4     5  27.9
10 SST345_S1           [62 x 1]   42.8     3  38.1
# ... with 13 more rows

This works great and I am thinking that I can adapt it to add some more columns. I built another function in order to count the number of As in a table here:

get_as <- function(x) {
  sum(between(z, 89.5, 105))
}

My thinking is that this will count the number of grades from 89.5% to 105%. When I try to use it with the x data table I am getting this error:

Error in between(z, 89.5, 105) : 
  'list' object cannot be coerced to type 'double'
In addition: Warning message:
 Error in between(z, 89.5, 105) : 
  'list' object cannot be coerced to type 'double'

Seems like it is saying that I cannot use a function which requires a type 'double' with a list of strings. I am not sure how to work around this though. Can someone please give me an idea of where I am going wrong here?

I'll look at this later tonight. The essence of the problem is that z is buried in the data column, which is a list of tibbles, which contain columns, and somewhere is a list object between get_as and the z variable. So the challenge is to dig it out.

Could you post the output of

str(x[1,2])

What are the actual ranges for the letter grades ?

How many letters are there?

Sorry, z is a mistake. I created another table using that object and forgot to switch back to x as you had in yours.

 $ percent: num 92.5```

Cutoffs would be:

89.5 and above - A
79.5 to 89.49 - B
69.5 to 79.49 - C
59.5 to 69.49 - D
below 59.5 - F

I think if I understand how to make one column I should be able to make the rest.

I modified that function to this:

get_as <- function(x) { sum(between(x, 89.5, 105)) }

Then ran this code block without an error and created a new data table, course_grades:

x %>% nest_by(course) -> course_grades
map(1:nrow(course_grades),get_as) %>%
  unlist() %>%
  add_column(A_Grades =., course_grades) -> course_grades

But the result was a new column with a sum of 0 for every row:

# A tibble: 23 x 3
# Rowwise:  course
   course                  data A_Grades
   <chr>     <list<tbl_df[,1]>>    <int>
 1 SST110_S1          [440 x 1]        0
 2 SST111_S2          [431 x 1]        0
 3 SST210_S1          [339 x 1]        0
 4 SST211_S2          [335 x 1]        0
 5 SST306_S2           [13 x 1]        0
 6 SST310_S1          [198 x 1]        0
 7 SST311_S2          [204 x 1]        0
 8 SST325_S1           [58 x 1]        0
 9 SST325_S2           [88 x 1]        0
10 SST345_S1           [62 x 1]        0
# ... with 13 more rows

I also get a warning when I just run this get_as(x[2,2]):

[1] 0
Warning message:
between() called on numeric vector with S3 class

It occurred to me we already have letter grades, in a variable named letter_grd.

If we strip off the pluses and minuses perhaps we can just make a table. So....

It looks like the table is including every course for every term, even if it was not offered which I suppose makes sense in term of a data table.

Anyway is this of use?

Forgot, the output is a list, not a data.frame.

library(tidyverse); library(reshape2); library(lubridate);
## read data in as a tibble

dat1 <- read_csv("grades1819.csv")

dat1 <- rename(dat1,
                 ID = student.enrollmentID,
                 gender = student.gender,
                 grade = student.grade,
                 course_no = grading.courseNumber,
                 course_name = grading.courseName,
                 teacher = grading.teacherDisplay,
                 term = grading.termName,
                 letter_grd = grading.progressScore,
                 percent = grading.progressPercent,
)



# ID is not a numeric so I converted it to "character".
dat1$ID  <-  as.character(dat1$ID)

library(stringr)  
library(janitor)

# subset dat1 so we don't mess up our main data.frame
dat3  <-  dat1  %>%  
             select(term,  course_no, letter_grd)

 # Strip off the pluses and minuses
dat3$letter_grd  <- str_sub(dat3$letter_grd, 1, 1)


letter_grades  <-   dat3   %>%  tabyl(course_no, letter_grd, term)
 

I had thought of this also. However, not all letter grades are perfectly aligned with the categories listed above. On occasion a student may receive a slightly higher letter grade than their percentage should earn them. As a result, it is not quite an accurate measure. Also, down the road I may want to look at other ranges that do not fall into the predefined letter categories.