Split uneven length vectors to columns with tidyr

tidyverse

#1

Is there a ‘tidy’ approach to splitting data from text into columns, where each ‘vector of text’ does not contain the same number of elements?

I’m having trouble where stringr::str_view will recognize the string I want to split on, but I can’t get tidyr::seperate, to separate the data properly.
I would assume as I want to split where three spaces occur, that the easiest way would be to simply specify the spaces in brackets, but I don’t think tidyr likes that?

library(stringr)
library(tidyr)
data<-tibble::tribble(
  ~value,
  "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
  "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
  "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
)

separate(data, value, into = c("Ratings","Weighted Avg","IBU","Est Calories","abv"),sep="[//s]",fill = "right")

#this works
str_view_all(string = "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",pattern = "[   ]+")

str_view_all(string = "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",pattern = "   ")

#but I can't split on it with tidyr's seperate

separate(data, value, into = c("Ratings","Weighted Avg","IBU","est Calories","abv"),sep= "   ",extra = "merge")
Warning message:
Too few values at 3 locations: 1, 2, 3 

#2

your problem, in your example, is that there is no IBU in the second entry so there are a not enough parts for the expected columns. you could add a fill setting to separate then it would work.


#3

more generally, with what you are trying to do overall in getting the data into columns, if you have a unique identifier for each rating, the next steps are then gather the separated columns, separate on the colon, and spread them.


#4

Hi,

you can use tidyverse tools to get a fixed length column to separate. I would do it like this, using :package: purrr to help deal with list results from stringr

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-11

library(stringr)
library(tidyr)
library(purrr)
library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data<-tibble::tribble(
  ~string,
  "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
  "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
  "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
)

# you can split the string in several, here is a sample
str_split(data$string[[1]], "  ")[[1]]
#> [1] "RATINGS: 4"            " MEAN: 3.83/5.0"       " WEIGHTED AVG: 3.39/5"
#> [4] " IBU: 35"              " EST. CALORIES: 204"   " ABV: 6.8%"

data %>%
  # you can apply the previous to each row using map
  mutate(split = map(string, ~ str_split(.x, "  ")[[1]])) %>%
  # then unnest the column before further data prep
  unnest() %>%
  # you can now separate in a fixed 2 length vector 
  separate(split, c("type", "valeur"), ": ") %>%
  # then get the result in column with NA in cells where you did not have value in string
  spread(type, valeur) %>%
  # for printing do not show string column
  select(-string)
#> # A tibble: 3 x 6
#>   ` ABV` ` EST. CALORIES` ` IBU`  ` MEAN` ` WEIGHTED AVG` RATINGS
#> *  <chr>            <chr>  <chr>    <chr>           <chr>   <chr>
#> 1   6.8%              204     35 3.83/5.0          3.39/5       4
#> 2   7.1%              213     85  3.7/5.0          3.45/5       8
#> 3  6.25%              188   <NA>     <NA>          3.64/5      89

#5

The map isn’t strictly necessary, since str_split will produce a list of results already. So, the second line of the pipe can be replaced with:

mutate(split = str_split(string, "  ")) %>%

without any changes in the output.

Also, I’d suggest adding the following line at the end (or somewhere after creating the new columns:

rename_all(str_trim) %>%

That will remove the whitespace before some column names.


#6

Thanks @nick for these great suggestions ! I missed it with str_split result! Thanks a lot ! :raised_hands:

Here is the new cleaner solution as reprex to easy copy-paste with reprex_clean()

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-11

library(tidyverse)
data<-tibble::tribble(
  ~string,
  "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
  "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
  "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
)
data %>%
  # by @nick suggestion
  mutate(split = str_split(string, "  ")) %>%
  # then unnest the column before further data prep
  unnest() %>%
  # you can now separate in a fixed 2 length vector 
  separate(split, c("type", "valeur"), ": ") %>%
  # then get the result in column with NA in cells where you did not have value in string
  spread(type, valeur) %>%
  # by @nick suggestion
  rename_all(str_trim) %>%
  # for printing do not show string column
  select(-string)
#> # A tibble: 3 x 6
#>     ABV `EST. CALORIES`   IBU     MEAN `WEIGHTED AVG` RATINGS
#> * <chr>           <chr> <chr>    <chr>          <chr>   <chr>
#> 1  6.8%             204    35 3.83/5.0         3.39/5       4
#> 2  7.1%             213    85  3.7/5.0         3.45/5       8
#> 3 6.25%             188  <NA>     <NA>         3.64/5      89

#7

There are a couple of issues with your first separate:

separate(data, value, into = c("Ratings","Weighted Avg","IBU","Est Calories","abv"),sep="[//s]",fill = "right")

s = "[//s]" Probably isn’t doing what you intend it to. sep is a regular expression that is used to match the text between columns.

This “[//s]” and this “[/s]” equivalent regular expressions. It is what regular expressions call a character class and matches any ‘s’ or ‘/’ , i.e. match any letter s or forward slash.

It may be that you meant "[\s] but that would not have worked either. Also "[\s] and “\s” are equivalent regular expression and match a whitespace character.

Another issue is that sep matches specifically the text between columns… so the “RATINGS” at the beginning of each line would be considered to be coming after a zero width column because there is no value which preceeds it.

Another thing about separate is that is cannot figure out where columns belong or what column names are, it just separates thing as it finds in order. Fill just tells it what to do if it cannot find enough pieces of data for all the column names listed in col.

The following works but it probably isn’t what you want but it may give you a better idea of how separate work.

BTW the sep value is as complicated as it is because the separator text has the format that it does.

suppressPackageStartupMessages(library("tidyverse"))
# make tribble
data <- tibble::tribble(
    ~value,
    "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
    "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
    "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
)
# regular expression that matches the text between columns
sp <- "\\p{Letter}+\\s+\\p{Letter}+:\\s+|\\p{Letter}+\\.\\s+\\p{Letter}+:\\s+|\\p{Letter}+:\\s+"
# pipe to process data
sd <- data %>% 
#  separate strings in to columns
# the "skip" column catches the empty column that preceds "Ratings:"
  separate(value,
                 into = c("skip", "Ratings", "Mean", "Weighted Avg","IBU","Est Calories","abv"),
                 sep = sp, fill = "right") %>%
# throw the skip column
  select("Mean", "Weighted Avg", "IBU", "Est Calories", "abv")
# view the content of sd
sd
#> # A tibble: 3 x 5
#>          Mean `Weighted Avg`   IBU `Est Calories`   abv
#> *       <chr>          <chr> <chr>          <chr> <chr>
#> 1 3.83/5.0         3.39/5    35            204     6.8%
#> 2   3.64/5            188    6.25%           <NA>  <NA>
#> 3  3.7/5.0         3.45/5    85            213     7.1%

In the end it just has the values, in order, that it found in each line… because separate doesn’t know what column a piece of data if finds belongs to, is just distinguishes separator text from values.


#8

Could you use the pluck function from purr instead of the [[1]] at the end of the str_splitt to pull out the element?

Maybe this is just me, but to simply ‘unnest’ data from str_split, I came across (at least) four different functions to potentially use?

#want to unnest list to chr vector
options are:
  -flatten()
  -unnest()
  -unlist()
  -squash()
  -anything in purrr ?

@mara is is possible to get some clarity around when we should be using the above functions for what, and when? I feel like I haven’t seem a complete workflow of how to use these tools properly to know which one to pick for the job. Maybe it’s just painful because of these two issues in particular:

https://github.com/tidyverse/purrr/issues/265
and:
https://github.com/tidyverse/tidyr/issues/287


#9

I shall look into it.


#10

Without trying to step on @mara’s toes, here’s my understanding of how dealing with lists → vectors should typically fall out:

  • rlang::flatten and rlang::squash: Input is a list, flatten removes one “level” from the list, while squash removes all levels (roughly equivalent to running flatten repeatedly until the data has no lists left). unlist is the base version of these, but can be difficult to predict what the type of the output will be on unknown data. Ideally, you use flatten_*, like flatten_int, so that you can force the output into the form that you expect as part of the process, rather than potentially going through multiple conversions with the associated problems that can cause.
  • tidyr::unnest: Input is a data frame that includes at least one list column (containing vectors or data frames). It will duplicate all other columns so that each item of the vector (or each row of the data frame) gets its own row. It is inverted by tidyr::nest, which is one potential way that the list column could have been generated in the first place.

So, in general, if you are working in a data frame already, you use unnest (like in this example). If you are working on the list directly, you use flatten and friends. Where the Github issues that you pointed out run into problems is that unnest currently only handles list columns containing atomic vectors and data frames, not a list of lists. That’s not super common (at least from my usage), but it should be a future feature of tidyr, given this Github issue:
https://github.com/tidyverse/tidyr/issues/278

That shows a good toy example of where the current tools fall down, but I don’t think it really applies in this case.

Also, you are correct that the pluck function could be used instead of [[1]], though the later version of the code makes that unnecessary. You could also use map, since providing a number to it is similar to mapping pluck onto each element of the input. These four methods of using str_split give equivalent output:

suppressPackageStartupMessages(library(tidyverse))

fruits <- c(
  "apples and oranges and pears and bananas",
  "pineapples and mangos and guavas"
)

# Equivalent to original code
map(fruits, ~ str_split(.x, " and ")[[1]])

# With pluck
map(fruits,
    ~ str_split(.x, " and ") %>% 
      pluck(1))

# With a second map
map(fruits, ~ str_split(.x, " and ")) %>% 
  map(1)

# Using the standard str_split method
str_split(fruits, " and ")

Character Encoding within the tidyverse
#11

Pretty sure @nick covered all of this (and more), but, since I wrote it up anyway…

unlist(), flatten(), and squash()

unlist()

  • base R generic that will take a list, x, and produce a vector with all
    atomic components of x
  • output determined by highest type component in hierarchy
    NULL < raw < logical < integer < double < complex < character < list < expression

flatten()

  • both rlang::flatten() and rlang::squash() can turn a list of lists into
    a list or, simpler, type-stable vector
  • flatten() only removes one level of hierarchy from a list
  • purrr::flatten() applies as described above, but takes the argument .x
  • the unsuffixed flatten() will return a list regardless of contents of input
    list, but contents must match type for flatten_*() funtions

squash()

  • removes all levels of hierarchy from list of lists, and otherwise
    follows the same conventions as rlang::flatten(), with type-stable output
    returned for suffixed functions

tidyr::unnest()

unnest()

  • tidyr::unnest() takes a list column, and makes each element into its own row
  • list column contents can be atomic vectors, or data frames,
    but each row must have the same number of entries

Character Encoding within the tidyverse
#12

@cderv @nick, @mara thanks so much for you help, so much to learn. Turns out, I actually need an ID in my example so my data looks like this:

#Sample Data
sample_data <- tibble::tibble(
  id = c(390639,99472,361258),
   ~value,
  "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
  "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
  "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
    )
  )
)  

That way, I can see what stats are for each respective url where the data comes from.
The only problem with that, is when I run the new data in, I get:

In addition: Warning message:
Too many values at 4 locations: 1, 2, 3, 4

So I’m assuming I would just need to turn your code into a function, to then map it via purrr, correct?


#13

Your sample code does not seems correct. If I understand well, you have one id per line.
It works for me with the same code I provided but I changed your sampled data to make an id column.

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-18

sample_data <- tibble::tribble(
  ~id,  ~value,
  390639, "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
  99472, "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
  361258, "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
)
library(tidyverse)
sample_data %>%
  # by @nick suggestion
  mutate(split = str_split(value, "  ")) %>%
  # then unnest the column before further data prep
  unnest() %>%
  # you can now separate in a fixed 2 length vector 
  separate(split, c("type", "valeur"), ": ") %>%
  # then get the result in column with NA in cells where you did not have value in string
  spread(type, valeur) %>%
  # by @nick suggestion
  rename_all(str_trim) %>%
  #for printing do not show string column
  select(-value)
#> # A tibble: 3 x 7
#>       id   ABV `EST. CALORIES`   IBU     MEAN `WEIGHTED AVG` RATINGS
#> *  <dbl> <chr>           <chr> <chr>    <chr>          <chr>   <chr>
#> 1  99472 6.25%             188  <NA>     <NA>         3.64/5      89
#> 2 361258  7.1%             213    85  3.7/5.0         3.45/5       8
#> 3 390639  6.8%             204    35 3.83/5.0         3.39/5       4

Id column is never touched in the code so it stays in place until the end.
is this the kind of result you are looking for ?


#14

Yeah, this is exactly it, I think I created the tibble wrong in my modified code, as I don’t believe you can mix both formula, and vector assignment when creating a tibble. Thanks for your help.


#15

Yes you can’t do both. Either use tibble with col = vec_value or tribble that allows you to use formula and creates a dataframe by line.

Glad I could help!


#16

Woof, turns out this isen’t merely an tibble issue, more so R encoding woes. Apparently the strings on here are encoded as ‘unknown’, yet work with the function you supplied. My data is encoded as UTF-8, which I don’t think tidyr likes, which is odd.

Just to confirm:

> library(magrittr)
> library(tidyverse)
> #My data
> test <- tibble::tribble(
+      ~id,                                                                                           ~value,
+   390639,  "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
+    99472,                           "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
+   361258,   "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
+   )
> Encoding(test$value)
[1] "UTF-8" "UTF-8" "UTF-8"
> 
> #data on rstudio communitty-- encoding is 'unknown', but works in function?
> sample_data <- tibble::tribble(
+   ~id,  ~value,
+   390639, "RATINGS: 4   MEAN: 3.83/5.0   WEIGHTED AVG: 3.39/5   IBU: 35   EST. CALORIES: 204   ABV: 6.8%",
+   99472, "RATINGS: 89   WEIGHTED AVG: 3.64/5   EST. CALORIES: 188   ABV: 6.25%",
+   361258, "RATINGS: 8   MEAN: 3.7/5.0   WEIGHTED AVG: 3.45/5   IBU: 85   EST. CALORIES: 213   ABV: 7.1%"
+ )
> Encoding(sample_data$value)
[1] "unknown" "unknown" "unknown"
> 
> 
> ####### test with my data
> test %>%
+    # by @nick suggestion
+    mutate(split = str_split(value, "  ")) %>%
+    # then unnest the column before further data prep
+    unnest() %>%
+    # you can now separate in a fixed 2 length vector 
+    separate(split, c("type", "valeur"), ": ") %>%
+    # then get the result in column with NA in cells where you did not have value in string
+    spread(type, valeur) %>%
+    # by @nick suggestion
+    rename_all(str_trim) %>%
+    #for printing do not show string column
+    select(-value)
# A tibble: 3 x 2
      id           RATINGS
*  <dbl>             <chr>
1  99472 89   WEIGHTED AVG
2 361258          8   MEAN
3 390639          4   MEAN
Warning message:
Too many values at 3 locations: 1, 2, 3 
> 
>  
> ####### data on rstudio communitty
> sample_data %>%
+    # by @nick suggestion
+    mutate(split = str_split(value, "  ")) %>%
+    # then unnest the column before further data prep
+    unnest() %>%
+    # you can now separate in a fixed 2 length vector 
+    separate(split, c("type", "valeur"), ": ") %>%
+    # then get the result in column with NA in cells where you did not have value in string
+    spread(type, valeur) %>%
+    # by @nick suggestion
+    rename_all(str_trim) %>%
+    #for printing do not show string column
+    select(-value)
# A tibble: 3 x 7
      id   ABV `EST. CALORIES`   IBU     MEAN `WEIGHTED AVG` RATINGS
*  <dbl> <chr>           <chr> <chr>    <chr>          <chr>   <chr>
1  99472 6.25%             188  <NA>     <NA>         3.64/5      89
2 361258  7.1%             213    85  3.7/5.0         3.45/5       8
3 390639  6.8%             204    35 3.83/5.0         3.39/5       4

Looks like separate fails, probably due to encoding issues? I would think you would not want to encode a string as “unknown” to make it work properly.:grimacing:
Maybe this is just me, but shouldn’t you always have an encoding type for a string?


#17

When you post problematic code, it’d be super helpful if you could do so in the form of a reprex. The major advantage with something like this is that it makes it much easier for someone else to re-run (i.e. no manual deletion of prompts).

It doesn’t actually have to be a reprex if that’s unappealing for some reason, but, if you look at @cderv’s example above, you can see the same type of “re-runnable” formatting used in reprexes (repreces?!).

The Code that I can easily run criteria from Jenny’s reprex slides! :point_down:


#18

Thanks for the code. Like @mara said, you could try to use the reprex format, it is easier. However, I manage to get it back with copy and reprex::reprex_rescue (Thanks @jennybryan for this emergency function !! :smile: )

That being said, I do not have any problem running your code ( tibble test) - it works on my system. However, I can’t explain why it is encoding unknown. Encoding is always tricky and I am on windows. If you want me to test your code with your encoding, I think best idea is to share a script as UTF8 in a gist. That way I can source directly your UTF-8 file. Here I must copy paste and encoding could be wrong that way.

About the error you get, there is 2 weak points I see in my code that could be the cause and that you must take into account and adapt to your situation :

  1. str_split(value, " ") : maybe you must split by something different than two spaces, and also it could be better to use regex \s (any whitespace)
  2. separate(split, c("type", "valeur"), ": ") : here too it is possible you have to split by something different. just : for example

Try to adapt the code in your specific situation. I do not have all the context and regex is something to make generic or adapt specifically. Morever copy pasting code is not the most safe practice. I encourage you to learn reprex to minimise the difficulty and make reproductible example. Also, sharing a script (on https://gist.github.com/) could help sometime too.


#19

Jenny always helps me help others, but open-source troubleshooting (and all troubleshooting, I guess) is a cooperative endeavor! :wink: