Create columns with unique values

Ani · March 23, 2019, 1:27am

I have student-level data where a single column lists for each student their unique major(s), minor(s), and any certificate(s) they may have earned while graduating.

The raw data look like this:

mydf <- structure(list(id = c("21996", "21996", "21996", "21996", "21996", 
"25294", "25294", "25294", "25294", "25294", "62410", "62410", 
"62410", "62410", "62410", "62656", "62656", "62656", "62656", 
"62656", "00073", "00073", "00073", "00073", "00073", "31337", 
"31337", "31337", "31337", "31337"), dtype = c("BB", "BB", "BB", 
"CT", "OR", "BS", "CT", "OR", "OR", "OR", "BA", "BA", "CT", "CT", 
"OR", "BC", "CT", "OR", "OR", "OR", "BS", "OR", "OR", "OR", "OR", 
"BS", "CT", "OR", "OR", "OR"), code = c("6120", "6126", "6158", 
"LEAD", "COMS", "6472", "DIAB", "2121", "3311", "6360", "4201", 
"4409A", "ISLM", "LJCU", "4211", "5363", "SALE", "6125", "6127", 
"BSAD", "7269", "4231", "4233", "7270", "BSAD", "6380", "SALE", 
"6127", "6181", "BSAD")), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))

For each unique id I would like to create columns listing their first major, second major, third major, and so on, followed by their first minor, second minor, and so on, and closing with their first certificate, second c certificate, and so on.

dtype values of CT represent a certificate
dtype values of OR represent a minor
!dtype %in% c("CT, "OR) represent a major

Each unique value of code represents the unique major/minor/certificate

The resulting data frame will be as follows:

out.df <- structure(list(id = c(21996, 25294, 62410, 62656, 73, 31337), 
    major1 = c(6120, 6472, 4201, 5363, 7269, 6380), major2 = c("6126", 
    NA, "4409A", NA, NA, NA), major3 = c(6158, NA, NA, NA, NA, 
    NA), major4 = c(NA, NA, NA, NA, NA, NA), minor1 = c("COMS", 
    "2121", "4211", "6125", "4231", "6127"), minor2 = c(NA, 3311, 
    NA, 6127, 4233, 6181), minor3 = c(NA, "6360", NA, "BSAD", 
    "7270", "BSAD"), minor4 = c(NA, NA, NA, NA, "BSAD", NA), 
    cert1 = c("LEAD", "DIAB", "ISLM", "SALE", NA, "SALE"), cert2 = c(NA, 
    NA, "LJCU", NA, NA, NA), cert3 = c(NA, NA, NA, NA, NA, NA
    ), cert4 = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Looks to me like I should be able to create placeholder columns that represent majors, minors, certs 1 through 4, and then pull in unique values of code for each dtype, inserting NA if the student has only one major, one minor, etc. However, the secret sauce is eluding me. I have a feeling that I can combine tidyr and dplyr to solve this but am drawing blanks. As usual, any and all leads are mucho mucho appreciated.

andresrcs · March 23, 2019, 3:07am

I think this is pretty close to what you want

library(dplyr)
library(tidyr)

mydf <- data.frame(stringsAsFactors=FALSE,
                   id = c("21996", "21996", "21996", "21996", "21996", "25294", "25294",
                          "25294", "25294", "25294", "62410", "62410", "62410", "62410",
                          "62410", "62656", "62656", "62656", "62656", "62656", "00073",
                          "00073", "00073", "00073", "00073", "31337", "31337", "31337",
                          "31337", "31337"),
                   dtype = c("BB", "BB", "BB", "CT", "OR", "BS", "CT", "OR", "OR", "OR",
                             "BA", "BA", "CT", "CT", "OR", "BC", "CT", "OR", "OR", "OR",
                             "BS", "OR", "OR", "OR", "OR", "BS", "CT", "OR", "OR", "OR"),
                   code = c("6120", "6126", "6158", "LEAD", "COMS", "6472", "DIAB",
                            "2121", "3311", "6360", "4201", "4409A", "ISLM", "LJCU",
                            "4211", "5363", "SALE", "6125", "6127", "BSAD", "7269", "4231",
                            "4233", "7270", "BSAD", "6380", "SALE", "6127", "6181", "BSAD")
)

mydf %>% 
    mutate(diploma = case_when( dtype == "CT" ~ "cert",
                                dtype == "OR" ~ "minor",
                                TRUE ~ "major")) %>% 
    arrange(id, diploma, code) %>% 
    group_by(id, diploma) %>% 
    mutate(name = paste0(diploma, row_number())) %>% 
    ungroup() %>%
    select(-diploma, -dtype) %>% 
    spread(name, code) %>% 
    select(id, starts_with("major"), starts_with("minor"), starts_with("cert"))
#> # A tibble: 6 x 10
#>   id    major1 major2 major3 minor1 minor2 minor3 minor4 cert1 cert2
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> <chr>
#> 1 00073 7269   <NA>   <NA>   4231   4233   7270   BSAD   <NA>  <NA> 
#> 2 21996 6120   6126   6158   COMS   <NA>   <NA>   <NA>   LEAD  <NA> 
#> 3 25294 6472   <NA>   <NA>   2121   3311   6360   <NA>   DIAB  <NA> 
#> 4 31337 6380   <NA>   <NA>   6127   6181   BSAD   <NA>   SALE  <NA> 
#> 5 62410 4201   4409A  <NA>   4211   <NA>   <NA>   <NA>   ISLM  LJCU 
#> 6 62656 5363   <NA>   <NA>   6125   6127   BSAD   <NA>   SALE  <NA>

^{Created on 2019-03-23 by the reprex package (v0.2.1)}

Ani · March 23, 2019, 1:22pm

Perfect solution! Thanks @andresrcs

andresrcs · March 23, 2019, 1:24pm

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

system · March 30, 2019, 1:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.