It is hard to describe in words. Therefore, made a reprex
with input, output and expected output below
How can we data wrangle
- When we function and mutate as shown below, there is ambiguity each time based on column name string
- how can we rbind these once we have unique column names
library(tidyverse)
# Basically, "." means ",". So, better we remove . and PC and convert to Numeric
df1 <- tribble(
~`ABC sales 01.01.2019 - 01.02.2019`, ~code,
"1.019 PC", 2000, # Actually, it 1019 (remove . and PC )
"100 PC", 2101,
"3.440 PC", 2002
)
df2 <- tribble(
~`ABC sales 01.03.2019 - 01.04.2019`, ~year,
"6.019 PC", 2019,
"20 PC", 2001,
"043.440 PC", 2002
)
df3 <- tribble(
~`ABC sales 01.05.2019 - 01.06.2019`, ~year,
"1.019 PC", 2000,
"701 PC", 2101,
"6.440 PC", 2002
)
# Input data
input_df = list(df1,df2,df3)
#### function to clean data
# str_replace is used twice because
# remove PC and dot
data_read = function(file){
df_ <- df %>% #glimpse()
# Select the column to remove PC, spaces and .
# Each time, column name differs so, `ABC sales 01.01.2019 - 01.02.2019` cannot be used
mutate_at(sales_dot = str_replace(select(contains('ABC')), "PC",""),
sales = str_replace(sales_dot, "\\.",""), # name the new column so that rbind can be applied later
sales_dot = NULL, # delete the old column
vars(contains("ABC")) = NULL # delete the old column
)
df_
}
# attempt to resolve
# To clean the data from dots and PC
output_df1 <- map(input_df, data_read) # or lapply ?
# rbind
output = map(output_df1, rbind) # or lapply ?
expected_output <- df3 <- tribble(
~sales, ~year,
"1019", 2000,
"100", 2101,
"3440", 2002,
"6019", 2019,
"20", 2001,
"043440", 2002,
"1019", 2000,
"701", 2101,
"6440", 2002
)