tidy list of 120 csv files

StefanVoga · May 21, 2019, 8:23am

hello,

i've got a problem with getting the data from https://www.kaggle.com/groffo/ads16-dataset into a usable form.
Here is an example file: https://www.dropbox.com/s/p5bfg3tssec0ow8/U0001-RT.csv?dl=0

i put all 120 *RT.csv files into one folder. One file contains 300 ad ratings.
I imported them with:

temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim, stringsAsFactors = FALSE, sep = ";")

my goal is to get a row per person and the 300 ad ratings as columns (maybe there is some way to use the already existing labels [from Cat0 to Cat19] , but divide it into 15 sub-categories [Cat01, Cat02...Cat11, Cat12]).

Maybe someone got a few tips for me (playing around for one day already - getting serious headache now). That would be great!

makrez · May 21, 2019, 11:29am

Hi,

The code I am providing is not very beautiful nor elegant, but it should do the trick and hopefully relieve you from your headache:)

library(dplyr)
library(tidyr)

temp = list.files(pattern="*.csv")
myfiles <- lapply(temp, function(filename){
  dat <- read.csv(filename, stringsAsFactors = FALSE, header = TRUE, sep = ";")
  dat$ID <- as.character(filename) # this adds an ID to each element of the list
  return(dat)})

transform_fun <- function(x){
  x %>% t() %>% as.data.frame() %>%  mutate(Cat = rownames(x %>% t() %>% as.data.frame())) %>%
    mutate(ID = x[1,21]) %>% 
    filter(Cat != "ID") %>% 
    separate(col = 2, into= c(as.character(1:15)),sep = ",") %>% 
    gather(subcategory, Value,2:16) %>% 
    mutate(Category = paste(Cat, subcategory, sep = "_")) %>% 
    select(Category, ID, Value) %>% 
    spread(Category, Value)
}

df <- lapply(myfiles, transform_fun) %>% do.call(bind_rows,.)

Hope that helps!

Edit: I changed rbind to bind_rows

StefanVoga · May 21, 2019, 12:24pm

wow, this is more than i expected. Thank you very, very much for your time!
i will analyze the code and hopefully i can learn to do it by myself the next time.

ambijat · June 4, 2019, 12:12pm

Hi Markez, the first part indeed was a great help, but I am unable to follow your second part. Now myfiles is a large list with each element as a dataframe. I want to filter certain columns for each dataframe (there were 70 files) in all.

system · June 25, 2019, 12:12pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.