Every `R`

problem can be thought of to advantage as the interaction of three objects: an existing object, x , a desired object, y , and a function,f that will return a value of y given x as an argument. In `R`

*everything* is an object and objects may contain other objects. Functions are first class objects, just as in f(g(x)).

With f(x)=y in mind, this problem has

x, a data frame of student records arranged in a `data frame`

by course and semester and, optionally year. If there are separate `data frames`

, additional preprocessing will be needed to use the script below.

y, a `data frame`

containing summary statistics by course and semester for `n`

, the number of students with `grading.progressPercent`

below a cutoff and the mean of `grading.progressPercent`

for such students. The cutoff is based on those values greater than the second quartile minus 1.5 times the interquartile difference of the fourth and second quartile. NOTE: this is the Tukey IQR implemented in the `fivenum`

function. The `IQR`

function uses a different algorithm. *Cf.* the difference between the boxplot shown from `ggplot2::geom_boxplot`

with that returned by the `boxplot`

function.

f could be created b composing the steps below into a single function, but it is preferable to leave them standalone to be able to follow the composition step by step.

In outline:

- Read in x and discard variables not relevant to the task
- Create a composite course/semester variable
- Determine the cutoff for classifying the
`variable`

- Identify the
`percent`

values less than that cutoff
- Take the mean of those values
- Roll everything into a single data frame, keeping a list for each course/semester combination of the
`percent`

values

An intermediate byproduct of the aggregation method is the creation of multi-level objects. Hence, the ugly subset operators. It will be worthwhile reviewing the syntax to see when to use object[row,column], object[element], object[row,] (all columns) and object[[1]], etc. Once internalized, it simplifies this class of problem.

There is an unresolved situation with `course_nest[5,]`

that should be hand-checked.

```
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(purrr)
library(tibble)
})
# identify points more than 1.5 IQR less than lower hinge, using
# Tukey method
get_cutoff <- function(x) {
fivenum(x)[2] -
(fivenum(x)[4] - fivenum(x)[2]) * 1.5
}
get_course_cutoff <- function(x) get_cutoff(course_nest$data[[x]][[1]])
get_out_mean <- function(x) {
mean(course_nest$data[[x]][[1]][which(course_nest$data[[x]][[1]] < course_nest[3][[1]][x])], na.rm = TRUE)
}
get_out_no <- function(x) {
length(which(course_nest$data[[x]][[1]] < course_nest[3][[1]]))
}
# data downloaded from "this workspace by OP"
readr::read_csv("/home/roc/Desktop/grist.csv") %>%
mutate(course =
paste0(grading.courseNumber,"_",grading.termName)) %>%
rename(percent = grading.progressPercent) %>%
select(course, percent) -> x
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> student.enrollmentID = col_double(),
#> student.gender = col_character(),
#> student.grade = col_double(),
#> grading.courseNumber = col_character(),
#> grading.courseName = col_character(),
#> grading.teacherDisplay = col_character(),
#> grading.termName = col_character(),
#> grading.progressScore = col_character(),
#> grading.progressPercent = col_double()
#> )
# display global outliers
ggplot(x,aes(percent)) +
geom_boxplot() +
coord_flip() +
theme_minimal()
```

```
# extract lower outlier threshold of all records, if needed
global_cut_off <- get_cutoff(x$percent)
x %>% nest_by(course) -> course_nest
map(1:nrow(course_nest),get_course_cutoff) %>%
unlist() %>%
add_column(cutoff =., course_nest) -> course_nest
map(1:nrow(course_nest),get_out_no) %>%
unlist() %>%
add_column(n =., course_nest) -> course_nest
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
#> Warning in course_nest$data[[x]][[1]] < course_nest[3][[1]]: longer object
#> length is not a multiple of shorter object length
map(1:nrow(course_nest),get_out_mean) %>%
unlist() %>%
add_column(mu =., course_nest) -> course_nest
course_nest %>% mutate(mu = ifelse(is.nan(mu),NA,mu)) -> y
y %>% print(n = Inf)
#> # A tibble: 23 x 5
#> # Rowwise: course
#> course data cutoff n mu
#> <chr> <list<tbl_df[,1]>> <dbl> <int> <dbl>
#> 1 SST110_S1 [440 × 1] 34.9 37 27.2
#> 2 SST111_S2 [431 × 1] 37.9 32 27.6
#> 3 SST210_S1 [339 × 1] 41.0 24 30.4
#> 4 SST211_S2 [335 × 1] 38.0 28 22.1
#> 5 SST306_S2 [13 × 1] -49.6 6 NA
#> 6 SST310_S1 [198 × 1] 43.7 11 30.8
#> 7 SST311_S2 [204 × 1] 43.1 15 26.9
#> 8 SST325_S1 [58 × 1] 59.6 0 54.2
#> 9 SST325_S2 [88 × 1] 38.4 5 27.9
#> 10 SST345_S1 [62 × 1] 42.8 3 38.1
#> 11 SST345_S2 [28 × 1] 51.7 0 NA
#> 12 SST408_S1 [62 × 1] 61.0 1 48.2
#> 13 SST409_S2 [61 × 1] 60.3 2 37.0
#> 14 SST415_S1 [92 × 1] 42.4 4 34.2
#> 15 SST415_S2 [162 × 1] 48.7 3 21.5
#> 16 SST420_S1 [25 × 1] 45.3 1 38.8
#> 17 SST421_S2 [23 × 1] 38.7 1 NA
#> 18 SST422_S1 [57 × 1] 53.1 3 40.3
#> 19 SST423_S2 [47 × 1] 63.6 2 51.2
#> 20 SST425_S1 [38 × 1] 63.7 0 NA
#> 21 SST430_S1 [38 × 1] 60.9 0 57.4
#> 22 SST431_S2 [39 × 1] 60.1 0 58.2
#> 23 SST451_S1 [41 × 1] 59.1 0 NA
```

^{Created on 2020-12-29 by the reprex package (v0.3.0.9001)}