I have a bunch of .csv files that I am merging into a single data.frame . I'm looking to create a source column that identifies the row by a piece of the filename. In this case, my file names are set-up as "zagats_food.csv". I'm looking to get the piece before the _, and have that under a new source column. In this case it would be zagats. Here's the code I have so far using here, purrr, and readr.
I would insert a step in which the filepaths get named. With their own basename() or fs::path_file(), probably. You could do you regex work here if you like. Then use the .id argument of map_dfr() to have those names show up as a column in the final result.
Re: how to make an example, you can create 2 or 3 dummy files, just to read back in. We don't need to work with your actual files, we just need a realistic scenario.
You need to capture the list of files and apply names to it, based on its values.
Those names will propagate to downstream work in a nice way and that is a general fact.
Something along these lines will work:
library(tidyverse)
writeLines("x,y\n1,a", "a.csv")
writeLines("x,y\n2,b", "b.csv")
writeLines("x,y\n3,c", "c.csv")
(my_files <- list.files(pattern = "*.csv"))
#> [1] "a.csv" "b.csv" "c.csv"
my_files %>%
set_names() %>%
map_dfr(read_csv, .id = "source")
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> # A tibble: 3 x 3
#> source x y
#> <chr> <dbl> <chr>
#> 1 a.csv 1 a
#> 2 b.csv 2 b
#> 3 c.csv 3 c
Created on 2018-12-13 by the reprex package (v0.2.1.9000)
That will definitely help me! Using your example, is there a way that I can name the source column into an abbreviated fashion so that I can strip zagat, from the filename before the map_dfr function?
library(tidyverse)
writeLines("x,y\n1,a", "zagat_food.csv")
writeLines("x,y\n2,b", "zagat_drink.csv")
writeLines("x,y\n3,c", "michelin_food.csv")
(my_files <- list.files(pattern = "^.*food.*"))
#> [1] "michelin_food.csv" "zagat_food.csv"
my_files %>%
set_names() %>%
map_dfr(read_csv, .id = "source")
#> Parsed with column specification:
#> cols(
#> x = col_integer(),
#> y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#> x = col_integer(),
#> y = col_character()
#> )
#> # A tibble: 2 x 3
#> source x y
#> <chr> <int> <chr>
#> 1 michelin_food.csv 3 c
#> 2 zagat_food.csv 1 a
If you strip the name before map_dfr(read_csv, .id = "source") then R can't perform read_csv() to read your data. You can strip the word zagat and michelin after it reading it.
library(tidyverse)
writeLines("x,y\n1,a", "zagat_food.csv")
writeLines("x,y\n2,b", "zagat_drink.csv")
writeLines("x,y\n3,c", "michelin_food.csv")
my_files <- list.files(pattern = "^.*food.*")
my_files %>%
set_names() %>%
map_dfr(read_csv, .id = "source") %>%
mutate(new_name = word(source, sep = "_"))
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> Parsed with column specification:
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> # A tibble: 2 x 4
#> source x y new_name
#> <chr> <dbl> <chr> <chr>
#> 1 michelin_food.csv 3 c michelin
#> 2 zagat_food.csv 1 a zagat