Why did column values disappear after running tidyr::separate?

Greetings,

I have this data frame which is a result from reading a PDF file with tabulizer::extract_tables()

library(dplyr)
library(tidyr)

dat <- structure(list(Name = c("Product 1", "Product 2", "Product 3", 
  "Product 4", "Product 5", "Product 6"), Size = c(750, 1750, 1750, 
  50, 375, 375), PK = c("1   19.95", "6   36.95", "6", "12", "1", 
  "1   13.95"), Metric = c(NA, NA, 69.95, 1, 13.95, NA)), row.names = c(1L, 
  2L, 3L, 4L, 5L, 6L), class = "data.frame")
dat
#>        Name Size        PK Metric
#> 1 Product 1  750 1   19.95     NA
#> 2 Product 2 1750 6   36.95     NA
#> 3 Product 3 1750         6  69.95
#> 4 Product 4   50        12   1.00
#> 5 Product 5  375         1  13.95
#> 6 Product 6  375 1   13.95     NA

str(dat)
#> 'data.frame':    6 obs. of  4 variables:
#>  $ Name  : chr  "Product 1" "Product 2" "Product 3" "Product 4" ...
#>  $ Size  : num  750 1750 1750 50 375 375
#>  $ PK    : chr  "1   19.95" "6   36.95" "6" "12" ...
#>  $ Metric: num  NA NA 70 1 13.9 ...

I want to separate the PK column then merge with the Metric column (& remove all NAs) to get this final table

#>        Name Size PK Metric
#> 1 Product 1  750  1  19.95
#> 2 Product 2 1750  6  36.95
#> 3 Product 3 1750  6  69.95
#> 4 Product 4   50 12   1.00
#> 5 Product 5  375  1  13.95
#> 6 Product 6  375  1  13.95

I tried tidyr::separate() first but the values that I want to merge with Metric column went away

dat_sep <- dat %>% 
  separate(PK, into = c("c1", "c2"), sep = " ",
           convert = TRUE, remove = FALSE)
#> Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [1, 2,
#> 6].

#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [3,
#> 4, 5].
dat_sep
#>        Name Size        PK c1 c2 Metric
#> 1 Product 1  750 1   19.95  1 NA     NA
#> 2 Product 2 1750 6   36.95  6 NA     NA
#> 3 Product 3 1750         6  6 NA  69.95
#> 4 Product 4   50        12 12 NA   1.00
#> 5 Product 5  375         1  1 NA  13.95
#> 6 Product 6  375 1   13.95  1 NA     NA

How do I go about fixing this?

Thank you!

Hey @RuReady! Thanks for the well-formatted example :smiley:

So, when you separate(), some of your rows have both values (eg. row 1: 1 19.95) and some don't (eg. row 3: 6). By default, separate() should handle this by "filling" the missing values from the right with NA, and this is what the second warning you're getting you is telling you. You can control this behaviour using the fill argument: if you give it fill = 'right' it'll do the same thing explicitly, and if you give it fill = 'left' it'll fill the left-most missing columns with NA instead.

The other warning you're getting is telling you that it's throwing away extra columns. As far as I can tell, this is because you're providing the separator as one space (sep = " "), when in your provided data there are actually three spaces (sep = " "). I'd say it's interpreting this as three separators, not one, and so your 'second' figure ends up being, like fourth (and, because you gave into two arguments, it sees this as two too many). If you explicitly give sep three spaces, this warning goes away and you get the expected result.

The last part of this is the dplyr::coalesce() function, which can merge several columns by using values in one to fill in NAs in another. You give it the columns in the order of importance: values in the first argument are kept first, values in the second are used for NAs, and so on.

So with all this, you'd do something like:

dat_sep <- dat %>% 
  separate(PK, into = c("c1", "c2"), sep = "   ",
           convert = TRUE, remove = FALSE, fill = "right") %>%
  # c1 has your integer values from PK; c2 has the content to merge into Metric
  mutate(Metric = coalesce(Metric, c2)) %>%
  # get rid of the columns we no longer need
  select(-PK, -c2) %>%
  rename(PK = c1)

dat_sep
#>        Name Size PK    c2 Metric
#> 1 Product 1  750  1 19.95  19.95
#> 2 Product 2 1750  6 36.95  36.95
#> 3 Product 3 1750  6    NA  69.95
#> 4 Product 4   50 12    NA   1.00
#> 5 Product 5  375  1    NA  13.95
#> 6 Product 6  375  1 13.95  13.95

Does that work for you?

5 Likes

Thank you for your detailed answer! Much appreciated!

1 Like

@rensa: your code works well for the 1st example but won't work if the sep is not three spaces. Considering this case where there are mixes of one, two & three spaces in PK column


dat2 <- structure(list(Name = c("Product 1", "Product 2", "Product 3", 
  "Product 4", "Product 5", "Product 6"), Size = c(750, 1750, 1750, 
  50, 375, 375), PK = c("12 19.95", "6  36.95", "6", "12", "1", 
  "1   13.95"), Metric = c(NA, NA, 69.95, 1, 13.95, NA)), row.names = c(1L, 
  2L, 3L, 4L, 5L, 6L), class = "data.frame")
dat2

#>        Name Size        PK Metric
#> 1 Product 1  750  12 19.95     NA
#> 2 Product 2 1750  6  36.95     NA
#> 3 Product 3 1750         6  69.95
#> 4 Product 4   50        12   1.00
#> 5 Product 5  375         1  13.95
#> 6 Product 6  375 1   13.95     NA

dat_sep <- dat2 %>% 
  separate(PK, into = c("c1", "c2"), sep = "   ",
           convert = TRUE, remove = FALSE, fill = "right") %>%
  # c1 has your integer values from PK; c2 has the content to merge into Metric
  mutate(Metric = coalesce(Metric, c2)) %>%
  # get rid of the columns we no longer need
  select(-PK, c2)

dat_sep
#>        Name Size       c1    c2 Metric
#> 1 Product 1  750 12 19.95    NA     NA
#> 2 Product 2 1750 6  36.95    NA     NA
#> 3 Product 3 1750        6    NA  69.95
#> 4 Product 4   50       12    NA   1.00
#> 5 Product 5  375        1    NA  13.95
#> 6 Product 6  375        1 13.95  13.95

How can we make the solution more generic?

That's fair! sep is apparently interpreted as a regular expression, so you should be able to use the pattern [\s]+ to match one or more whitespace characters as a single separator (keeping in mind that you need to escape \s again for R's sake):

dat %>% 
  separate(PK, into = c("c1", "c2"), sep = "[\\s]+",
           convert = TRUE, remove = FALSE, fill = "right") %>%
  mutate(Metric = coalesce(Metric, c2)) %>%
  select(-PK, -c2) %>%
  rename(PK = c1)
#>        Name Size PK Metric
#> 1 Product 1  750  1  19.95
#> 2 Product 2 1750  6  36.95
#> 3 Product 3 1750  6  69.95
#> 4 Product 4   50 12   1.00
#> 5 Product 5  375  1  13.95
#> 6 Product 6  375  1  13.95

How's that?

3 Likes

Perfect! Thank you!!!

1 Like