How to data wrangle and mutate at column containing specific string ?

It is hard to describe in words. Therefore, made a reprex
with input, output and expected output below

How can we data wrangle

  1. When we function and mutate as shown below, there is ambiguity each time based on column name string
  2. how can we rbind these once we have unique column names
library(tidyverse)
# Basically, "." means ",". So, better we remove . and PC and convert to Numeric
df1 <- tribble(
  ~`ABC sales 01.01.2019 - 01.02.2019`, ~code,
 "1.019 PC",   2000, # Actually, it 1019 (remove . and PC )
 "100 PC",   2101,
 "3.440 PC",   2002
)

df2 <- tribble(
  ~`ABC sales 01.03.2019 - 01.04.2019`, ~year,
  "6.019 PC",   2019, 
  "20 PC",   2001,
  "043.440 PC",   2002
)

df3 <- tribble(
  ~`ABC sales 01.05.2019 - 01.06.2019`, ~year,
  "1.019 PC",   2000, 
  "701 PC",   2101,
  "6.440 PC",   2002
)

# Input data
input_df = list(df1,df2,df3)

#### function to clean data
# str_replace is used twice because 
# remove PC and dot

data_read = function(file){

  df_ <- df %>% #glimpse()
    # Select the column to remove PC, spaces and .
    # Each time, column name differs so, `ABC sales 01.01.2019 - 01.02.2019` cannot be used
    mutate_at(sales_dot = str_replace(select(contains('ABC')), "PC",""),
              sales = str_replace(sales_dot, "\\.",""), # name the new column so that rbind can be applied later
              sales_dot = NULL, # delete the old column
              vars(contains("ABC")) = NULL # delete the old column
    )
  df_
}

# attempt to resolve
# To clean the data from dots and PC
output_df1 <- map(input_df, data_read) # or lapply ?
# rbind 
output = map(output_df1, rbind) # or lapply ?

expected_output <- df3 <- tribble(
  ~sales, ~year,
  "1019",   2000, 
  "100",   2101,
  "3440",   2002,
  "6019",   2019, 
  "20",   2001,
  "043440",   2002,
  "1019",   2000,
  "701",   2101,
  "6440",   2002
)
library(tidyverse)
df1 <- tribble(
  ~`ABC sales 01.01.2019 - 01.02.2019`, ~year,
  "1.019 PC",   2000, # Actually, it 1019 (remove . and PC )
  "100 PC",   2101,
  "3.440 PC",   2002
)

df2 <- tribble(
  ~`ABC sales 01.03.2019 - 01.04.2019`, ~year,
  "6.019 PC",   2019, 
  "20 PC",   2001,
  "043.440 PC",   2002
)

df3 <- tribble(
  ~`ABC sales 01.05.2019 - 01.06.2019`, ~year,
  "1.019 PC",   2000, 
  "701 PC",   2101,
  "6.440 PC",   2002
)

# Input data
input_df = list(df1,df2,df3)

data_read = function(df){
  df_ <- df %>% #glimpse()
    mutate_at(.vars = vars(contains('ABC')),
              .funs = str_replace_all, pattern = "PC|\\.| ", replacement = ""
    )  
  Nms <- colnames(df_)
  colnames(df_) <- str_replace(Nms, "^ABC.+", "ABC")
  df_
}

output_df1 <- map(input_df, data_read) # or lapply ?
dfFinal <- bind_rows(output_df1)
dfFinal
#> # A tibble: 9 x 2
#>   ABC     year
#>   <chr>  <dbl>
#> 1 1019    2000
#> 2 100     2101
#> 3 3440    2002
#> 4 6019    2019
#> 5 20      2001
#> 6 043440  2002
#> 7 1019    2000
#> 8 701     2101
#> 9 6440    2002

Created on 2020-01-27 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.