Creating a Source column based on partial filename using Purrr

realhiphop · December 13, 2018, 8:05pm

I have a bunch of .csv files that I am merging into a single data.frame . I'm looking to create a source column that identifies the row by a piece of the filename. In this case, my file names are set-up as "zagats_food.csv". I'm looking to get the piece before the _, and have that under a new source column. In this case it would be zagats. Here's the code I have so far using here, purrr, and readr.

library(here)
library(tidyverse)


restaurants <- fs::dir_ls(here::here("input_data"), regexp = "^.*food.*") %>% 
  map_dfr(read_csv)

I am then doing all of my data wrangling in a pipe after the map_dfr command.

Thanks in advance!

jennybryan · December 13, 2018, 11:17pm

Without code I can run, I can just offer advice.

I would insert a step in which the filepaths get named. With their own basename() or fs::path_file(), probably. You could do you regex work here if you like. Then use the .id argument of map_dfr() to have those names show up as a column in the final result.

realhiphop · December 14, 2018, 2:20am

Thanks. Sorry about the above example. I'm still relatively new to R. How can I produce a more easily example while using here? I tried below:

library(tidyverse)
library(fs)

data_dir <- "input_data"

fs::dir_ls(data_dir)

input_data/michelin_drink.csv input_data/michelin_food.csv  input_data/zagat_drink.csv    
input_data/zagat_food.csv

restaurants <- data_dir %>% 
  dir_ls(regexp = "^.*food.*") %>% 
  map_dfr(read_csv, .id = "source")

jennybryan · December 14, 2018, 6:41am

Re: how to make an example, you can create 2 or 3 dummy files, just to read back in. We don't need to work with your actual files, we just need a realistic scenario.

You need to capture the list of files and apply names to it, based on its values.

Those names will propagate to downstream work in a nice way and that is a general fact.

Something along these lines will work:

library(tidyverse)

writeLines("x,y\n1,a", "a.csv")
writeLines("x,y\n2,b", "b.csv")
writeLines("x,y\n3,c", "c.csv")

(my_files <- list.files(pattern = "*.csv"))
#> [1] "a.csv" "b.csv" "c.csv"

my_files %>%
  set_names() %>% 
  map_dfr(read_csv, .id = "source") 
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
#> # A tibble: 3 x 3
#>   source     x y    
#>   <chr>  <dbl> <chr>
#> 1 a.csv      1 a    
#> 2 b.csv      2 b    
#> 3 c.csv      3 c

^{Created on 2018-12-13 by the reprex package (v0.2.1.9000)}

realhiphop · December 14, 2018, 10:53am

That will definitely help me! Using your example, is there a way that I can name the source column into an abbreviated fashion so that I can strip zagat, from the filename before the map_dfr function?

library(tidyverse)

writeLines("x,y\n1,a", "zagat_food.csv")
writeLines("x,y\n2,b", "zagat_drink.csv")
writeLines("x,y\n3,c", "michelin_food.csv")

(my_files <- list.files(pattern = "^.*food.*"))
#> [1] "michelin_food.csv" "zagat_food.csv"


my_files %>%
  set_names() %>% 
  map_dfr(read_csv, .id = "source") 
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> # A tibble: 2 x 3
#>   source                x y    
#>   <chr>             <int> <chr>
#> 1 michelin_food.csv     3 c    
#> 2 zagat_food.csv        1 a

Created on 2018-12-14 by the reprex package (v0.2.1)

rexevan · December 17, 2018, 6:52am

If you strip the name before map_dfr(read_csv, .id = "source") then R can't perform read_csv() to read your data. You can strip the word zagat and michelin after it reading it.

library(tidyverse)

writeLines("x,y\n1,a", "zagat_food.csv")
writeLines("x,y\n2,b", "zagat_drink.csv")
writeLines("x,y\n3,c", "michelin_food.csv")

my_files <- list.files(pattern = "^.*food.*")

my_files %>%
  set_names() %>% 
  map_dfr(read_csv, .id = "source") %>%
  mutate(new_name  = word(source, sep = "_"))
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
#> # A tibble: 2 x 4
#>   source                x y     new_name
#>   <chr>             <dbl> <chr> <chr>   
#> 1 michelin_food.csv     3 c     michelin
#> 2 zagat_food.csv        1 a     zagat

^{Created on 2018-12-17 by the reprex package (v0.2.1)}

realhiphop · December 17, 2018, 3:05pm

Thanks I used parts of both your and jennybryan's suggestion.

mutate(source = word(basename(source), sep = "_"), allowed me to parse from the full path filename given to me in the .id field from the here package.

system · December 24, 2018, 3:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.