To divide a dataset or not: How to approach multiple analyses from one data frame

timwilson · April 29, 2019, 3:50am

Seeking a nudge in the right direction here.

I'm a novice R programmer who's leaped into it and the tidyverse with gusto. I've created a project for myself, the goal of which is to create a system that allows for a comparison of results for archers across a variety of archery disciplines. To that end, I've assembled a sample of approximately 11,000 individual scores from the last three years of some major archery competitions in the U.S. The dataset can downloaded from my web site here: 2016-2018_all_scores.csv

Once downloaded, the following code loads the dataset and sets the appropriate column names:

results <- read_csv("2016-2018_all_scores.csv",
                   col_names = c("year", "event", "class", "division", "gender", "org", "round", "score"),
                   col_types = "fffffffd")

The columns of the most interest are class, round, and score. I need to analyze the scores for each different archery round and equipment class in the dataset. There's a total of 17 unique archery rounds and 2 equipment classes for a total of 34 separate analyses. For each of the 34 different combinations I need to:

Remove the scores below the 10th percentile and above the 90th percentile.
Calculate a CDF for the remaining values
Fit a linear or exponential model to the CDF (depending on the how the curve looks)

I've described this process in more detail in a short article, The Performance Method: An Improved Archer Ranking System for Determining a “Shooter of the Year". (All created in an R Notebook. What an amazing tool!)

The big question

My first instinct was to use dplyr's filter and mutate to slice and dice myself 34 data frames out of the original dataset which would be further processed to arrive at each individual model.

Realization #1: I hate repeating myself

The thought of typing out all of the piped commands to create those 34 data frames makes even my novice R programmer skin crawl. There has to be a better way.

Realization #2: apply functions including `purrr` are cool

I can see in principal how useful it would be to apply the functional programming approach to this problem. Is it possible to programmatically create those 34 data frames by creating a function to filter the data set? Here's a function that does the job.

# Filter dataset to get results for a specific round and equipment class
# The divisor parameter converts a two-day event to a one-day event (e.g., divisor = 2)
filter_results <- function(results, archery_round, equipment_class, divisor = 1) {
  df <- results %>%
    filter(round == archery_round,
           class == equipment_class) %>%
    mutate(score = score / divisor)

  return (df)
}

But how do I apply that function to create 34 new data frames that don't already exist? I'm stuck.

Realization #3: Do I need 34 data frames after all?

Can I filter, trim, create the CDFs and build those models without creating separate data frames? WWHD? (What Would Hadley Do?)

That's where I find myself. I could have done everything the manual way by now, but I wouldn't learn anything new doing it that way.

So I'm asking the experience R programmers here, how would you do it? What path would you send me down?

mishabalyasin · April 29, 2019, 7:01am

I think, good place to start is the chapter in R4DS: https://r4ds.had.co.nz/many-models.html

Main idea is that you definitely want to group by your 2 columns and get 34 rows in a data frame that you can purrr over with the function you need. As a piece of advice, start with the simplest function (i.e., your first step of three step process) to get a feel for what is happening before moving on with the rest of data cleansing.

nwerth · April 29, 2019, 1:29pm

What I'd suggest:

Write a function that takes a subset of your data and returns a model
Create a new dataset where each "observation unit" is a class/round combination, with another column being a list containing the respective subsets.
Apply the analysis function from step 1 along the list column, saving the results in another list column.

An example with the CO2 dataset using tidyr::nest to create the subdata column and modeling the relation of conc to uptake by each Plant/Type group:

library(dplyr)
library(tidyr)

head(CO2)

#   Plant   Type  Treatment conc uptake
# 1   Qn1 Quebec nonchilled   95   16.0
# 2   Qn1 Quebec nonchilled  175   30.4
# 3   Qn1 Quebec nonchilled  250   34.8
# 4   Qn1 Quebec nonchilled  350   37.2
# 5   Qn1 Quebec nonchilled  500   35.3
# 6   Qn1 Quebec nonchilled  675   39.2

model_relation <- function(subdata) {
  lm(uptake ~ conc, data = subdata)
}

analysis <- CO2 %>%
  tidyr::nest(-Plant, -Type, .key = "subdata") %>%
  mutate(model = lapply(subdata, model_relation))

as_tibble(analysis)
# # A tibble: 12 x 4
#    Plant Type        subdata              model   
#    <ord> <fct>       <list>               <list>  
#  1 Qn1   Quebec      <data.frame [7 x 3]> <S3: lm>
#  2 Qn2   Quebec      <data.frame [7 x 3]> <S3: lm>
#  3 Qn3   Quebec      <data.frame [7 x 3]> <S3: lm>
#  4 Qc1   Quebec      <data.frame [7 x 3]> <S3: lm>
#  5 Qc2   Quebec      <data.frame [7 x 3]> <S3: lm>
#  6 Qc3   Quebec      <data.frame [7 x 3]> <S3: lm>
#  7 Mn1   Mississippi <data.frame [7 x 3]> <S3: lm>
#  8 Mn2   Mississippi <data.frame [7 x 3]> <S3: lm>
#  9 Mn3   Mississippi <data.frame [7 x 3]> <S3: lm>
# 10 Mc1   Mississippi <data.frame [7 x 3]> <S3: lm>
# 11 Mc2   Mississippi <data.frame [7 x 3]> <S3: lm>
# 12 Mc3   Mississippi <data.frame [7 x 3]> <S3: lm>

With list columns, functions like purrr::map and purrr::pluck help with any subsequent manipulations.

andresrcs · April 29, 2019, 1:51pm

This example using your data, could be a starting point for you as well as a reprex for your question.

library(tidyverse)

results <- read_csv(url("http://archerytoolkit.net/soty/2016-2018_all_scores.csv"),
                    col_names = c("year", "event", "class", "division", "gender", "org", "round", "score"),
                    col_types = "fffffffd")
results %>% 
    group_by(round, class) %>%
    filter(score > quantile(score, 0.1), score < quantile(score, 0.9)) %>% # Filter by percentiles
    mutate(n = row_number()) %>%
    group_nest() %>% # Nest your data into individual dataframes per group
    mutate(model = map(data, ~lm(score ~ n, data = .))) # Apply a linear model to each group
#> # A tibble: 20 x 4
#>    round               class    data                 model   
#>    <fct>               <fct>    <list>               <list>  
#>  1 Field               Compound <tibble [246 × 7]>   <S3: lm>
#>  2 Field               Recurve  <tibble [12 × 7]>    <S3: lm>
#>  3 Hunter              Compound <tibble [244 × 7]>   <S3: lm>
#>  4 Hunter              Recurve  <tibble [12 × 7]>    <S3: lm>
#>  5 Animal              Compound <tibble [242 × 7]>   <S3: lm>
#>  6 Animal              Recurve  <tibble [38 × 7]>    <S3: lm>
#>  7 NFAA Indoor         Compound <tibble [1,120 × 7]> <S3: lm>
#>  8 NFAA Indoor         Recurve  <tibble [134 × 7]>   <S3: lm>
#>  9 NFAA 900            Compound <tibble [95 × 7]>    <S3: lm>
#> 10 NFAA 900            Recurve  <tibble [16 × 7]>    <S3: lm>
#> 11 NFAA 600            Compound <tibble [101 × 7]>   <S3: lm>
#> 12 NFAA 600            Recurve  <tibble [16 × 7]>    <S3: lm>
#> 13 USA Archery Indoor  Compound <tibble [702 × 7]>   <S3: lm>
#> 14 USA Archery Indoor  Recurve  <tibble [926 × 7]>   <S3: lm>
#> 15 USA Archery Outdoor Compound <tibble [169 × 7]>   <S3: lm>
#> 16 USA Archery Outdoor Recurve  <tibble [255 × 7]>   <S3: lm>
#> 17 Vegas               Compound <tibble [3,587 × 7]> <S3: lm>
#> 18 Vegas               Recurve  <tibble [1,042 × 7]> <S3: lm>
#> 19 25-Meter            Compound <tibble [231 × 7]>   <S3: lm>
#> 20 25-Meter            Recurve  <tibble [38 × 7]>    <S3: lm>

^{Created on 2019-04-29 by the reprex package (v0.2.1)}

timwilson · April 29, 2019, 10:04pm

Thanks for the thoughtful replies. I'm going to give these a go and see what I come up with.

timwilson · May 17, 2019, 10:42pm

I'm trying to understand the various mapping functions better, but I keep hitting a wall.

In the helpful example you provided, how would I add an additional variable via mutate that would contain the highest score value in each grouped data frame? (The variable is called score.)

andresrcs · May 17, 2019, 11:14pm

Your sample data is no longer available, could you check on that?

library(tidyverse)
results <- read_csv(url("http://archerytoolkit.net/soty/2016-2018_all_scores.csv"),
                        col_names = c("year", "event", "class", "division", "gender", "org", "round", "score"),
                        col_types = "fffffffd")
#> Warning in open.connection(con, "rb"): cannot open URL 'http://
#> archerytoolkit.net/soty/2016-2018_all_scores.csv': HTTP status was '404 Not
#> Found'
#> Error in open.connection(con, "rb"): cannot open the connection to 'http://archerytoolkit.net/soty/2016-2018_all_scores.csv'

timwilson · May 18, 2019, 2:51am

@andresrcs, sorry about that. The file is there now, but I'm afraid the content has changed slightly with ongoing work on the project. There's a header row that gives the column names. The group_by variables are now round and equipment_class. The scores are still in the score column.

andresrcs · May 18, 2019, 3:19am

Is this what you want to do?

library(tidyverse)

results <- read_csv(url("http://archerytoolkit.net/soty/2016-2018_all_scores.csv"),
                    col_names = TRUE)
results %>% 
    group_by(round, equipment_class) %>%
    filter(score > quantile(score, 0.1), score < quantile(score, 0.9)) %>%
    mutate(n = row_number()) %>%
    group_nest() %>%
    mutate(model = map(data, ~lm(score ~ n, data = .)),
           max_score = map_dbl(data, ~max(.$score, na.rm = TRUE)))
#> # A tibble: 79 x 5
#>    round                equipment_class  data               model max_score
#>    <chr>                <chr>            <list>             <lis>     <dbl>
#>  1 15 meters (36 arrow… Compound         <tibble [2 × 19]>  <lm>        272
#>  2 18 meters (60 arrow… Barebow          <tibble [14 × 19]> <lm>        465
#>  3 18 meters (60 arrow… Compound         <tibble [575 × 19… <lm>        576
#>  4 18 meters (60 arrow… Compound Limited <tibble [4 × 19]>  <lm>        539
#>  5 18 meters (60 arrow… Recurve          <tibble [70 × 19]> <lm>        534
#>  6 20 meters (36 arrow… Compound         <tibble [14 × 19]> <lm>        351
#>  7 20 meters (36 arrow… Recurve          <tibble [8 × 19]>  <lm>        317
#>  8 20 meters (72 arrow… Compound         <tibble [9 × 19]>  <lm>        621
#>  9 25 meters (36 arrow… Compound         <tibble [16 × 19]> <lm>        341
#> 10 25 meters (36 arrow… Recurve          <tibble [6 × 19]>  <lm>        275
#> # … with 69 more rows

timwilson · May 18, 2019, 3:55am

Yes, thank you. I was missing the tilde in the map_dbl function. I'm still trying to wrap my brain around the various map functions even after multiple reviews of the iteration chapter in R For Data Science. The functional programming model is very different than what I'm used to.

timwilson · May 18, 2019, 9:34pm

Is it possible to add the score coefficient of the lm model to the results? I'd tried this, but it didn't work.

results %>% 
    group_by(round, equipment_class) %>%
    filter(score > quantile(score, 0.1), score < quantile(score, 0.9)) %>%
    mutate(n = row_number()) %>%
    group_nest() %>%
    mutate(model = map(data, ~lm(score ~ n, data = .)),
           score_param = coef(model)[["score"]],
           max_score = map_dbl(data, ~max(.$score, na.rm = TRUE))
    )

There's no error message. The score_param column just doesn't show up. Is this possible?

andresrcs · May 18, 2019, 10:56pm

If I understand you correctly, because of the formula score ~ n, you are trying to get the intercept

library(tidyverse)
library(broom)
results <- read_csv(url("http://archerytoolkit.net/soty/2016-2018_all_scores.csv"),
                    col_names = TRUE)

results %>% 
    group_by(round, equipment_class) %>%
    filter(score > quantile(score, 0.1), score < quantile(score, 0.9)) %>%
    mutate(n = row_number()) %>%
    group_nest() %>%
    mutate(model = map(data, ~lm(score ~ n, data = .)),
           intercept = map_dbl(model, ~tidy(.)$estimate[1]),
           max_score = map_dbl(data, ~max(.$score, na.rm = TRUE)))
#> # A tibble: 79 x 6
#>    round           equipment_class  data          model intercept max_score
#>    <chr>           <chr>            <list>        <lis>     <dbl>     <dbl>
#>  1 15 meters (36 … Compound         <tibble [2 ×… <lm>       216.       272
#>  2 18 meters (60 … Barebow          <tibble [14 … <lm>       440.       465
#>  3 18 meters (60 … Compound         <tibble [575… <lm>       544.       576
#>  4 18 meters (60 … Compound Limited <tibble [4 ×… <lm>       524        539
#>  5 18 meters (60 … Recurve          <tibble [70 … <lm>       483.       534
#>  6 20 meters (36 … Compound         <tibble [14 … <lm>       337.       351
#>  7 20 meters (36 … Recurve          <tibble [8 ×… <lm>       272.       317
#>  8 20 meters (72 … Compound         <tibble [9 ×… <lm>       574.       621
#>  9 25 meters (36 … Compound         <tibble [16 … <lm>       321.       341
#> 10 25 meters (36 … Recurve          <tibble [6 ×… <lm>       221.       275
#> # … with 69 more rows

timwilson · May 19, 2019, 12:03am

Interesting. The broom package and tidy() function are new to me. Very useful indeed!

system · June 9, 2019, 12:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

To divide a dataset or not: How to approach multiple analyses from one data frame

The big question

Realization #1: I hate repeating myself

Realization #2: apply functions including purrr are cool

Realization #3: Do I need 34 data frames after all?

Realization #2: apply functions including `purrr` are cool