Manipulating a List in a Single Excel Cell

I have a list of items and settings with dates in a single Excel cell like so:

Item A (1993) Setting A, Item B (1994) Setting B, Item C (1995), Item D (1992) Setting D,...

I wish to parse the cell and take each setting that is located after a date and before a comma and place it in its own cell like so:

Setting A

Setting B

Setting D

I’m used to other programming languages, so I know how to do this there, but I’m trying to learn R and after reading the documentation, I did not see how I could do this within one or more steps. Hopefully some of you know what I’m looking for. Thanks in advance!

You can do it using "regular expressions" like in other programming languages.

library(stringr)

single_cell <- "Item A (1993) Setting A, Item B (1994) Setting B, Item C (1995), Item D (1992) Setting D"

str_extract_all(single_cell, "Setting\\s.")
#> [[1]]
#> [1] "Setting A" "Setting B" "Setting D"

Created on 2020-01-25 by the reprex package (v0.3.0.9000)

Sorry if I made it unclear with my example.

Pencil (1992) Texas, Toiler Paper (1983) New Jersey, Lasers and Lasagna (2020), Windshield Wiper Fluid (2002) New York City,...

Yields:

Texas

New Jersey

New York City

While the extract all function actually might be what I need to use, what is the expression I must use to extract the data between the date-parentheses and the comma.

I might see a way to solve it in multiple steps by taking each entry out one at a time with Item + (Date) + Setting and then taking two more steps to delete Item and Date, but is that the best way to do it?

Domain-specific knowledge is required to fine-tune a regular expression but this could work for you as a starting point.

library(stringr)

single_cell <- "Pencil (1992) Texas, Toiler Paper (1983) New Jersey, Lasers and Lasagna (2020), Windshield Wiper Fluid (2002) New York City"

str_extract_all(single_cell, "(?<=\\)\\s)[^[:punct:]]+(?=,?)")
#> [[1]]
#> [1] "Texas"         "New Jersey"    "New York City"
2 Likes

You're amazing! This almost did exactly what I wanted. The only reason it did not do it all was because I still did not give you a perfect example. I tried to parse exactly how "(?<=\)\s)[^[:punct:]]+(?=,?)" should be interpreted with my eyes, but I am just not there yet. Can you explain step-by-step how I should read this expression?

The reason it was not perfect because my example was still lacking. Hopefully, I get it right this time.

Pencil (1992) Texas, Toiler Paper (1983) New Jer., Lasers and Lasagna (2020), Wind. Wiper Fld. (2002) N.Y. C.,

Should Yield:

Texas

New Jer.

N. Y. C.

(Your expression ended up yielding:

Texas

New Jer

N)

Even if you cannot help beyond this point, I want to thank you for helping me along so far. You've done so much for me!

I can modify the "regular expression" to match your new example, but that does not guarantee that it is going to work for the rest of your data.

library(stringr)

single_cell <- "Pencil (1992) Texas, Toiler Paper (1983) New Jer., Lasers and Lasagna (2020), Wind. Wiper Fld. (2002) N.Y. C.,"

str_extract_all(single_cell, "(?<=\\)\\s).+?(?=,)")
#> [[1]]
#> [1] "Texas"    "New Jer." "N.Y. C."

(?<=\)\s) This part looks for a ")" and an empty space before the text you are going to extract
.+? This part matches any character one or more times but the shortest string possible.
(?=,) This looks for a "," following the text you are going to extract.

If you want to tailor this to your actual data I think you should look into "regular expressions" they are not an R feature but common to many programming languages.

1 Like

Regex is certainly a powerful language, but it can take study and practice to master.

Here is an attempt at an approach based on string splitting, only on exact matches, first on bracket going one way, and then bracket going the other way. i used fixed=TRUE parameter, to explicitly not use Regex style pattern, but exact symbol matching.

library(tidyverse)
library(purrr)
single_cell <- "Pencil (1992) Texas, Toiler Paper (1983) New Jersey, Lasers and Lasagna (2020), Windshield Wiper Fluid (2002) New York City"


split1<- strsplit(single_cell,")",fixed=TRUE)[[1]]
split2<- map(split1, ~ strsplit(.,"(",fixed=TRUE)[[1]][[1]])
result <- unlist(split2) %>% enframe(name=NULL)

> split1
[1] "Pencil (1992"                          " Texas, Toiler Paper (1983"           
[3] " New Jersey, Lasers and Lasagna (2020" ", Windshield Wiper Fluid (2002"       
[5] " New York City"                       
> split2
[[1]]
[1] "Pencil "

[[2]]
[1] " Texas, Toiler Paper "

[[3]]
[1] " New Jersey, Lasers and Lasagna "

[[4]]
[1] ", Windshield Wiper Fluid "

[[5]]
[1] " New York City"

> result
# A tibble: 5 x 1
  value                             
  <chr>                             
1 "Pencil "                         
2 " Texas, Toiler Paper "           
3 " New Jersey, Lasers and Lasagna "
4 ", Windshield Wiper Fluid "       
5 " New York City"

It worked! Thank you so much! I've started learning/playing with regex on regexr, but it certainly isn't as intuitive as some other things I've learned in programming. Hopefully it will click sooner or later.

Nearly all the cells were perfect except in weird cases where the term before the dated parentheses had its own string within a parentheses, example:

Before

Albuquerque (1992) Stat Med,

Lasagna (1993) J Cran,

Receiver-operating characteristic (ROC) plots: a tool in medicine (1993) Clin Chem

After

Stat Med

J Cran

plots: a tool in medicine (1993) Clin Chem [Needs to be Clin Chem]

I understand this occurred because of the unforeseen parentheses that came before the date, but I am wondering whether I need to refine the original regex and apply it to the "Before" column or would it be easier/possible to create a new regex that can be applied to the "After" column. I've been toying with the After column, but no matter what regex I formulate, it would also operate and change the cells I do not want to change.

I would do the edits manually if I did not have a list of 100k items to go through.

Sorry for bothering you with what must be clerical work for you, but all the assistance you've given me has been extremely helpful and exactly what I've needed.

Thanks again!

I would do this

library(stringr)

single_cell <- "Albuquerque (1992) Stat Med, Lasagna (1993) J Cran, Receiver-operating characteristic (ROC) plots: a tool in medicine (1993) Clin Chem"

str_extract_all(single_cell, "(?<=\\d{4}\\)\\s).+?(?=(,|$))")
#> [[1]]
#> [1] "Stat Med"  "J Cran"    "Clin Chem"
1 Like

Worked like a charm <3!

And I totally understand that you inserted d{4} so that your expression would only target the parentheses with a 4 digit number in them. I really should've thought of that after all your teaching so far :joy:. Much appreciated Mr. andresrcs.

Please let me know if there's something I can do for you aside from marking your response as the solution.

Your welcome! If you want to do something for people helping you in this forum, please make sure to familiarize with the forum guidelines before you post again here, that would help us help you and keep the forum tidy.

#meta:faq

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.