Title Here

This is a companion discussion topic for the original entry at https://www.rstudio.com/blog/title-here


Often, real-life data requires some level of cleaning. R packages help us transform data into a tidy, analyzable structure. However, if we jump into cleaning data without pausing and thinking about it beforehand, we may run into issues down the road. We may realize we need a complicated workflow to get the data into the correct format, or that we repeat ourselves too much, or we end up finding that the package we tried is not quite right and then we have to find an alternative.

We propose a few questions to guide you to the right tool for the job:

  • What does the data look like?

Part of the challenge is that the structure varies from dataset to dataset.

Tidy datasets are all alike but every messy dataset is messy in its own way. — Tidy Data, Hadley Wickham

Having a principled approach to data cleaning helps. If we know we want our dataset to be in a tidy format — where each variable forms a column, each observation forms a row, and each type of observational unit forms a table — then we can determine what transformations will lead to our desired result.

For some datasets, tidying the dataset can be as simple as pivoting it into a long format. For others, we may see that columns have multiple variables stored. Or, we notice that our data is in separate Excel sheets. And so on.

Taking a close look at the dataset can help us decide whether we need conditional statements or another approach. [how to make this determination?]

  • How often will I be doing this analysis?

If we are running an analysis only once, we may decide that a long, “not pretty” script is enough for what we need to do. However, if we know that we will be doing the same analysis next year, it may be worth taking the time to set ourselves up for the future.

A Real-Life Dataset

This blog post will be going through a real-life example from the Washington State Department of Education. The DOE stores their state test scores in a CVS spreadsheet. It contains 110 columns for each school’s Reading, Writing, Science, and Math tests for grades 1 - 14:

library(readr)

wa_head <-
read_csv(here::here("2022-02-17-title-here", "data", "wa_head.csv"))

wa_head

## # A tibble: 6 × 110
##   SchoolYear ESD       CountyNumber County CountyDistrictN… District GradeTested
##                                              
## 1 2010-11    Educatio…            1 Adams              1109 Washtuc…           3
## 2 2010-11    Educatio…            1 Adams              1109 Washtuc…           4
## 3 2010-11    Educatio…            1 Adams              1109 Washtuc…           5
## 4 2010-11    Educatio…            1 Adams              1109 Washtuc…           6
## 5 2010-11    Educatio…            1 Adams              1109 Washtuc…           7
## 6 2010-11    Educatio…            1 Adams              1109 Washtuc…           8
## # … with 103 more variables: SubGroup , ReadingTotalTestedNotTested ,
## #   ReadingTotalMSP_HSPETested ,
## #   ReadingMetStandardIncludingPrevPass ,
## #   ReadingPercentMetStandardIncludingPrevPass ,
## #   ReadingMetStandardWithoutPrevPass ,
## #   ReadingPercentMetStandardWithoutPrevPass ,
## #   ReadingPercentMetStandardExcludingNoScore , ReadingLevel1 , …

Let’s take a look at the column names:

wa_colnames <-
  read_rds(here::here("2022-02-17-title-here", "data", "wa_colnames.Rds"))

wa_colnames[c(24, 25, 49, 50)] # Taking four as an example

## [1] "ReadingLevel4"        "ReadingPercentLevel4" "MathLevel4"          
## [4] "MathPercentLevel4"

We can see that a column contains subject, measure, and grade level:

For the purposes of this post, let’s look at a simplified table:

## # A tibble: 6 × 12
##   SchoolYear ESD     GradeTested SubGroup ReadingLevel4 ReadingPercentLevel4
##                                               
## 1 2010-11    ESD 123           3 Male                23                 14.5
## 2 2010-11    ESD 123           4 Male                16                 11.5
## 3 2010-11    ESD 123           5 Male                37                 24.6
## 4 2010-11    ESD 123           3 Female              36                 26  
## 5 2010-11    ESD 123           4 Female              30                 20.5
## 6 2010-11    ESD 123           5 Female              39                 33  
## # … with 6 more variables: MathLevel4 , MathPercentLevel4 ,
## #   WritingLevel4 , WritingPercentLevel4 , ScienceLevel4 ,
## #   SciencePercentLevel4 

First, let’s think about what the data looks like. This data is not tidy. First, the data is a “wide” format, so we need to make it “long” if we want to use it in packages like ggplot2. We can do this using the tidyr function pivot_longer().

dat_melt <- tidyr::pivot_longer(
  dat,
  cols = ReadingLevel4:SciencePercentLevel4,
  names_to = "variable",
  values_to = "value"
)

Second, multiple variables are stored together in columns.

Nested if-else statements

If-else statements are an option once our data is in a long format (that is, the column names are in a single column). The if statement determines whether a statement based on a certain condition. Otherwise, it will do something else.

Nested if statements mean an if statement inside an if statement.

dat_melt
## # A tibble: 48 × 6
##    SchoolYear ESD     GradeTested SubGroup variable             value
##                                        
##  1 2010-11    ESD 123           3 Male     ReadingLevel4         23  
##  2 2010-11    ESD 123           3 Male     ReadingPercentLevel4  14.5
##  3 2010-11    ESD 123           3 Male     MathLevel4            18  
##  4 2010-11    ESD 123           3 Male     MathPercentLevel4     11.3
##  5 2010-11    ESD 123           3 Male     WritingLevel4         NA  
##  6 2010-11    ESD 123           3 Male     WritingPercentLevel4  NA  
##  7 2010-11    ESD 123           3 Male     ScienceLevel4         NA  
##  8 2010-11    ESD 123           3 Male     SciencePercentLevel4  NA  
##  9 2010-11    ESD 123           4 Male     ReadingLevel4         16  
## 10 2010-11    ESD 123           4 Male     ReadingPercentLevel4  11.5
## # … with 38 more rows
# Or, using reshape2::melt
# dat_melt <-
#   reshape2::melt(dat, id = c("SchoolYear", "ESD", "GradeTested", "SubGroup"))

Since our variable is now contained under “variable”, we can separate all the pieces into individual columns. We can run our if_else() statements:

dat_melt$Subject <- ifelse(grepl("Reading", dat_melt$variable), "Reading",
                                  ifelse(grepl("Math", dat_melt$variable), "Math",
                                         ifelse(grepl("Writing", dat_melt$variable), "Writing", "Science")))

However, there are a few drawbacks. Multiple if-else statements need to be “nested”, or defined within the previous function. While this works well on our simplified dataset, this becomes very hard to manage if we have many columns. The text starts to wrap and we lose track of parentheses, so it’s easy to get lost:

\[image\]

Are there other options?

dplyr::case_when() and data.table::fcase()

The case_when() and fcase() functions solve the issue of nesting functions. [how they work rather than nesting]

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

dat_melt %>%
mutate(variable = as.character(variable)) %>%
mutate(
Subject = case_when(
str_detect(variable, "Math") ~ "Math",
str_detect(variable, "Reading") ~ "Reading",
str_detect(variable, "Writing") ~ "Writing",
str_detect(variable, "Science") ~ "Science"
)
)

## # A tibble: 48 × 7
##    SchoolYear ESD     GradeTested SubGroup variable             value Subject
##                                           
##  1 2010-11    ESD 123           3 Male     ReadingLevel4         23   Reading
##  2 2010-11    ESD 123           3 Male     ReadingPercentLevel4  14.5 Reading
##  3 2010-11    ESD 123           3 Male     MathLevel4            18   Math   
##  4 2010-11    ESD 123           3 Male     MathPercentLevel4     11.3 Math   
##  5 2010-11    ESD 123           3 Male     WritingLevel4         NA   Writing
##  6 2010-11    ESD 123           3 Male     WritingPercentLevel4  NA   Writing
##  7 2010-11    ESD 123           3 Male     ScienceLevel4         NA   Science
##  8 2010-11    ESD 123           3 Male     SciencePercentLevel4  NA   Science
##  9 2010-11    ESD 123           4 Male     ReadingLevel4         16   Reading
## 10 2010-11    ESD 123           4 Male     ReadingPercentLevel4  11.5 Reading
## # … with 38 more rows

The advantages of this approach is that it is much “cleaner” — it’s easy to see what is being changed.

[disadvantages?]

A Problem Solved by Joins