Behavior of "select" and str_split differs from df$variable

I'm writing a function that will generate co-occurrence lists (trying to generalize an existing script to turn it into a package), and the beginning of the function requires me to select the column with comma-delimited IDs, and then use str_split() to separate them into a list of character vectors.

If I use the format df$variable to start, I get the expected output - a list of character vectors (list of 15, in this case), and clean strings.

If I use df %>% select(variable) %>% to start, I get a list of one. This is expected, I guess, but I'm not sure how to work around this. Also, it doesn't appear to be splitting the strings in the same way (e.g., "4023") - I've tried removing whitespace, etc.

I know in my naive way I think of these (df$variable, df %>% select(variable)) as accomplishing the same thing in other contexts, but I know this is also not the case. I'm not sure what select() is doing in the background or how to achieve the results I would like: being able to get the results of df$variable but generalizing to a function. If anyone has any insight, that would be marvelous. The output of this goes into a map(expand.grid()) function, if that is helpful.

Results from reprex():

Using select:

llibrary(tidyverse)
library(stringr)
library(nycflights13)

set.seed(12)
test <- 
  sample_n(flights, 1000)
flts <- aggregate(flight ~ carrier, paste, data = test, collapse = ",") 
str(flts)
#> 'data.frame':    14 obs. of  2 variables:
#>  $ carrier: chr  "9E" "AA" "AS" "B6" ...
#>  $ flight : chr  "4023,2906,4135,3807,3459,3443,3525,4192,2934,3445,4120,3913,3367,4065,3992,3970,3400,3311,4220,3540,3367,4120,2"| __truncated__ 

flts %>%
  select(flight) %>% 
  str_split(",") 
#> [[1]]
#>    [1] "c(\"4023"  "2906"      "4135"      "3807"      "3459"     
#>    [6] "3443"      "3525"      "4192"      "2934"      "3445"     
#>   [11] "4120"      "3913"      "3367"      "4065"      "3992"     
#>   [16] "3970"      "3400"      "3311"      "4220"      "3540"     
#>   [21] "3367"      "4120"      "2908"      "3310"      "3357"     
#>   [26] "4305"      "4060"      "3319"      "3393"      "4220"     
#>   [31] "2912"      "3321"      "3353"      "4127"      "4178"     
#>   [36] "3881"      "4178"      "3304"      "3523"      "3538"     
#>   [41] "4275"      "3795"      "3325"      "3410"      "3855"     
#>   [46] "3393"      "4060"      "3347"      "2951"      "3354"     
#>   [51] "3439"      "3470"      "3910"      "3405"      "3623"     
#>   [56] "3932"      "4218\""    " \"1357"   "2314"      "1925"  
#>   [61] "325"       "211"       "1623"      "321"       "1103"     
#>   [66] "1769"      "854"       "655"       "1850"      "1073"     
#>   [71] "345"       "1999"      "565"       "2019"      "269"      
#>   [76] "33"        "715"       "145"       "413"       "117"      

...(truncated)

#' Created on 2018-03-14 by the reprex package (v0.2.0).

And using df$variable:

library(tidyverse)
library(stringr)
library(nycflights13)

set.seed(12)
test <- 
  sample_n(flights, 1000)
flts <- aggregate(flight ~ carrier, paste, data = test, collapse = ",") 
str(flts)
#> 'data.frame':    14 obs. of  2 variables:
#>  $ carrier: chr  "9E" "AA" "AS" "B6" ...
#>  $ flight : chr  "4023,2906,4135,3807,3459,3443,3525,4192,2934,3445,4120,3913,3367,4065,3992,3970,3400,3311,4220,3540,3367,4120,2"| __truncated__ 

flts$flight %>%
  str_split(",")
#> [[1]]
#>  [1] "4023" "2906" "4135" "3807" "3459" "3443" "3525" "4192" "2934" "3445"
#> [11] "4120" "3913" "3367" "4065" "3992" "3970" "3400" "3311" "4220" "3540"
#> [21] "3367" "4120" "2908" "3310" "3357" "4305" "4060" "3319" "3393" "4220"
#> [31] "2912" "3321" "3353" "4127" "4178" "3881" "4178" "3304" "3523" "3538"
#> [41] "4275" "3795" "3325" "3410" "3855" "3393" "4060" "3347" "2951" "3354"
#> [51] "3439" "3470" "3910" "3405" "3623" "3932" "4218"
#> 
#> [[2]]
#>  [1] "1357" "2314" "1925" "325"  "211"  "1623" "321"  "1103" "1769" "854" 
#> [11] "655"  "1850" "1073" "345"  "1999" "565"  "2019" "269"  "33"   "715" 
#> [21] "145"  "413"  "117"  "1750" "1327" "1621" "301"  "1769" "172"  "717" 
#> [31] "2314" "371"  "269"  "707"  "1757" "313"  "731"  "341"  "145"  "1507"
#> [41] "3"    "145"  "84"   "145"  "739"  "1762" "1410" "84"   "19"   "305" 
#> [51] "119"  "307"  "1999" "1709" "359"  "269"  "543"  "300"  "19"   "1837"
#> [61] "1073" "133"  "745"  "33"   "269"  "59"   "1709" "1145" "1223" "1357"
#> [71] "753"  "1"    "2279" "85"   "19"   "2279" "84"   "1611" "753"  "19"  
(...truncated)

#' Created on 2018-03-14 by the reprex package (v0.2.0).

Two ways you've described are not the same.

When you use dplyr, all verbs will return data.frames. When you use $ to extract a variable, it'll return a vector.

If you want to mimic what $ is doing, but stay with dplyr, you can use dplyr::pull. It'll output a vector.

Also, you are almost there with reprex, but something went wrong with formatting. I suspect, you've copied the text from the "View" pane in RStudio, so formatting messed up a bit. Once you run reprex(), you'll get your result in clipboard, so you can paste it directly here.

Finally, what you are trying to do can be done with the following slight modification:

library(tidyverse)
library(stringr)
library(nycflights13)

set.seed(12)
test <- sample_n(flights, 1000)
flts <- aggregate(flight ~ carrier, paste, data = test, collapse = ",")

res <- flts %>%
  select(flight) %>%
  dplyr::mutate(splitted = str_split(flight, ","))

res$splitted[[1]]
#>  [1] "4023" "2906" "4135" "3807" "3459" "3443" "3525" "4192" "2934" "3445"
#> [11] "4120" "3913" "3367" "4065" "3992" "3970" "3400" "3311" "4220" "3540"
#> [21] "3367" "4120" "2908" "3310" "3357" "4305" "4060" "3319" "3393" "4220"
#> [31] "2912" "3321" "3353" "4127" "4178" "3881" "4178" "3304" "3523" "3538"
#> [41] "4275" "3795" "3325" "3410" "3855" "3393" "4060" "3347" "2951" "3354"
#> [51] "3439" "3470" "3910" "3405" "3623" "3932" "4218"

Created on 2018-03-14 by the reprex package (v0.2.0).

1 Like

Using pull works perfectly, thanks. I knew the two weren't equivalent, but I couldn't quite figure out how to get around it.

And I did just paste whatever reprex() put on my clipboard, but for some reason that didn't seem to carry the ```r with it...I'll have to add it next time.

Thank you!