Data Wrangling best option: separating variables identified by spreadsheet cell formatting

Hi everyone,

This is just a glimpse of my column that I need to transform...What would be the best solution for creating two new columns called Client and Country ?

If I need to clarify please tell me.

Thanks in advance,
Milos

Unless there is a uniform number of clients per country, I think you're going to have to wrestle with the highlighting you've been handed, which isn't something readxl handles.

When you're dealing with the use of formatting (e.g. cell colours) as a sort of data encoding, I'd recommend looking at a pair of packages by Duncan Garmonsway, tidyxl (despite the name, not actually part of the tidyverse) and unpivotr


Duncan's detailed patterns for this scenario in the write-up below:

https://nacnudus.github.io/spreadsheet-munging-strategies/an-example-formatting-lookup.html

1 Like

If you can make a list of the values in your yellow cells, the following non-tidy solution may work.

library(dplyr)
DF <- data.frame(Orig = c("A", "AAA", "CCC", "EEE", "TTT", 
                          "B", "WS", "ED", "RG", "YH", "UJ"), stringsAsFactors = FALSE)
DF
#>    Orig
#> 1     A
#> 2   AAA
#> 3   CCC
#> 4   EEE
#> 5   TTT
#> 6     B
#> 7    WS
#> 8    ED
#> 9    RG
#> 10   YH
#> 11   UJ
Countries <- c("A", "B")
DF <- DF %>% mutate(Tag = ifelse(Orig %in% Countries, "Country", "Client"))
DF
#>    Orig     Tag
#> 1     A Country
#> 2   AAA  Client
#> 3   CCC  Client
#> 4   EEE  Client
#> 5   TTT  Client
#> 6     B Country
#> 7    WS  Client
#> 8    ED  Client
#> 9    RG  Client
#> 10   YH  Client
#> 11   UJ  Client
DF$Country <- NA
DF[1, "Country"] <-  DF[1, "Orig"]
for(i in 2:nrow(DF)){
  if (DF[i, "Tag"] == "Client") {
    DF[i, "Country"] <- DF[i - 1, "Country"]
  } else  {
    DF[i, "Country"] <- DF[i, "Orig"]
  }
}
DF <- DF %>% filter(!Orig %in% Countries) %>% 
  select(Country, Client = Orig)
DF
#>   Country Client
#> 1       A    AAA
#> 2       A    CCC
#> 3       A    EEE
#> 4       A    TTT
#> 5       B     WS
#> 6       B     ED
#> 7       B     RG
#> 8       B     YH
#> 9       B     UJ

Created on 2019-09-03 by the reprex package (v0.2.1)

2 Likes

I'd like to offer a small modification to avoid the for loop:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(Orig = c("A", "AAA", "CCC", "EEE", "TTT", 
                                                    "B", "WS", "ED", "RG", "YH", "UJ"), stringsAsFactors = FALSE)
Countries <- c("A", "B")
DF <- DF %>% mutate(Tag = ifelse(Orig %in% Countries, "Country", "Client"))

DF %>% 
    mutate(Country = ifelse(Tag == "Country", Orig, NA)) %>% 
    tidyr::fill(Country) %>% 
    filter(Tag != "Country") %>% 
    select(Country, Client = Orig)

#>   Country Client
#> 1       A    AAA
#> 2       A    CCC
#> 3       A    EEE
#> 4       A    TTT
#> 5       B     WS
#> 6       B     ED
#> 7       B     RG
#> 8       B     YH
#> 9       B     UJ

Created on 2019-09-03 by the reprex package (v0.3.0)

2 Likes

Thanks a lot Valeri, it works perfectly :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.