how to split a column with no pattern

omarh2m · March 13, 2021, 2:40pm

for example: I have a single column data frame(single row) like this, what I want to do is to split it into several columns but the problem is that there is no pattern I can base the split upon.

PR7 <- c("PR7 - Autres produits d'exploitation 6.9 371 667 1 389")
bt_PR7 <- data.frame(PR7)
what i did is as follow :
gi<-transform(bt_PR7, variable = substr(PR7, 6, 36), montant_06_2020 = substr(PR7, 41, 44),
montant_06_2019 = substr(PR7,45,48),
montant_12_2019 = substr(PR7,49,54))
gi <- subset( gi, select = -PR7 )
the output==> Capture

is there an easiest way beacause this is an example of many ...thanks in advance i really appreciate it if you can help

fredoxvii · March 13, 2021, 4:06pm

Hello,

You could try str_split() from the stringr package.

technocrat · March 14, 2021, 1:00am

This almost does it, but there's no way to resolve the ambiguity introduced by embedded banks among the numeral strings to both delimit values and serve as a big.mark: "...1 389" could mean 1 and 389 or 1389, as strings.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

hyphen <- "-"
numerals <- "\\s[0-9\\.]+"
chars <- "^[:graph:]\\s+"

PR7 <- c("PR7 - Autres produits d'exploitation 6.9 371 667 1 389")

n_vars <- str_extract_all(PR7,numerals)[[1]] %>% str_squish() %>% as.numeric()
c_vars <- str_remove_all(PR7,numerals) %>% str_remove_all(.,hyphen) %>% str_squish()

n_vars
#> [1]   6.9 371.0 667.0   1.0 389.0
c_vars
#> [1] "PR7 Autres produits d'exploitation"

omarh2m · March 14, 2021, 9:55am

@ technocrat ,thanks i tried something like this myself but like you said with 1 389 could mean 1 and 389 or 1389, i made an ocr of an image and so so that's why its so hard
readbaddata <- function(x) {
dat <- read.table(text = x)
whatIsIt <- sapply(dat, typeof)
variable <- paste(dat[whatIsIt == "character"], collapse = " ")
res <- data.frame(
variable = variable,
dat[, whatIsIt != "character"])
names(res)[-1] <- c("note","Year2021", "Year2020", "Year2019")
return(res)
}
out <- lapply(PR7,readbaddata)

nirgrahamuk · March 14, 2021, 12:15pm

perhaps the image contains other format clues though, eg. cell borders ?

technocrat · March 14, 2021, 6:44pm

What I'd do in your situation is to save the data frame as you have it as a CSV. Edit the file to remove quotation marks and globally replace runs of blanks with a single comma. Eliminate spelling single quote marks or substitute with ZZ for later correction. Use readr::read_csv to read it back in.

The read function will give you the parsing failures— x columns expected, y columns found and you can use the warnings() function to locate the lines with offending data and go back to the CSV file to hand edit. (I know, but really what choice to you have other than training your own OCR?)

Read back in again and the remaining problem is to restore the ' marks and glue back together the character string and then proceed as suggested.

technocrat · March 15, 2021, 5:12am

Here is an ugly approach, which assumes that all of the blank-embedded records are in the last position. It exploits the behavior of readr::read_csv, which will produce NAs when one more entry than header columns is read-in.

suppressPackageStartupMessages({
  library(dplyr)
  library(glue)
  library(magrittr)
  library(readr)
  library(stringi)
  library(stringr)
  library(tidyr)
})

hyphen <- "-"
numerals <- "\\s[0-9\\.]+"
chars <- "^[:graph:]\\s+"
header <- c("id", "var1", "var2", "var3", "var4", "bad")
singlemark <- "\\'"
PR7 <- c("PR7 - Autres produits d'exploitation 6.9 371 667 1 389")

n_vars <- str_extract_all(PR7, numerals)[[1]] %>%
  str_squish() %>%
  paste(., collapse = ",")

 c_vars <- str_remove_all(PR7, numerals) %>%
  str_remove_all(., hyphen) %>%
  str_replace(., singlemark, "ZZ") %>%
  str_squish()


# read_csv(dta.csv) -> dta 
 # id,var1,var2,var3,var4,bad
 # PR7 Autres produits dZZexploitation,6.9,371,667,1,389
 # PR7 Autres produits dZZexploitation,6.9,371,667,1389,NA
 # PR7 Autres produits dZZexploitation,6.9,371,667,1,389
 # PR7 Autres produits dZZexploitation,6.9,371,667,1389,NA
 # PR7 Autres produits dZZexploitation,6.9,371,667,1,389
 # PR7 Autres produits dZZexploitation,6.9,371,667,1389,NA
 # PR7 Autres produits dZZexploitation,6.9,371,667,1,389
 # PR7 Autres produits dZZexploitation,6.9,371,667,1,389

dta <- data.frame(
  id =
    c("PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation", 
      "PR7 Autres produits dZZexploitation"),
  var1 =
    c(6.9, 6.9, 6.9, 6.9, 6.9, 6.9, 6.9, 6.9),
  var2 =
    c(371, 371, 371, 371, 371, 371, 371, 371),
  var3 =
    c(667, 667, 667, 667, 667, 667, 667, 667),
  var4 =
    c(1, 1389, 1, 1389, 1, 1389, 1, 1),
  bad =
    c(389, NA, 389, NA, 389, NA, 389, 389))

dta %<>% mutate(id = str_replace(id,"ZZ","'"))

good <- filter(dta, is.na(bad))
bad  <- filter(dta, !is.na(bad))

bad %<>% mutate(var4 = paste0(var4,bad),
                var4 = as.numeric(var4))

add_row(good,bad) %>% select(-bad)
#>                                   id var1 var2 var3 var4
#> 1 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 2 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 3 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 4 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 5 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 6 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 7 PR7 Autres produits d'exploitation  6.9  371  667 1389
#> 8 PR7 Autres produits d'exploitation  6.9  371  667 1389

omarh2m · March 15, 2021, 10:29am

@ technocrat Thank you, this is probably the closest I've come to a good solution; thank you, I really appreciate your effort.

system · April 5, 2021, 10:29am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.