Use function with dataframe in mutate


#1

I need to split a string in a dataframe at the last “space” before character 25
I built this function that works for plain text variables, but does not work in mutate.
What am I missing??

# Example 
# HLF 11/7/17
# Goal: Split text string at last space before char 24
#
suppressWarnings(library(tidyverse))
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
suppressWarnings(library(stringr))
suppressWarnings(library(reprex))
splitdesc <- function(t){
  t1 <- str_sub(t, 1, 25)
  x1 <- str_locate_all(t1, " ") # get positions of all spaces
  x1df <- as.data.frame(x1) # convert to data frame to find nrow
  r1 <- nrow(x1df) # get position of last row
  s1 <- x1df[r1,1] # get value from last row
  desc1 <- str_trim(str_sub(t, 1, s1))
  return(desc1)
}
#
test1 <- c("now is the time for all good men to come...")
splitdesc(test1)  # this works
#> [1] "now is the time for all"
#
test2 <- tibble(CODE = c("a", "b"), DESC = c("apple peaches pumpkin pie for Thanksgiving",
                                            "the quick brown fox jumped over the lazy dog")                      ) 
test3 <- mutate(test2, DESC3 = splitdesc(DESC)) # this does mot work
#> Error in mutate_impl(.data, dots): Evaluation error: arguments imply differing number of rows: 3, 4.

#2

Note that if you run your command outside of mutatesplitdesc(test2$DESC) – you’ll get the same error. This is an indication that the problem is due to the changing input, not because it’s in mutate.

Let’s try debugging the function. One way is to run the code outside of the function

t <- test2$DESC
t1 <- str_sub(t, 1, 25)
x1 <- str_locate_all(t1, " ") # get positions of all spaces
x1df <- as.data.frame(x1) # convert to data frame to find nrow
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
#   arguments imply differing number of rows: 3, 4

What does x1 look like at that point?

[[1]]
     start end
[1,]     6   6
[2,]    14  14
[3,]    22  22

[[2]]
     start end
[1,]     4   4
[2,]    10  10
[3,]    16  16
[4,]    20  20

So, str_locate_all produces a list. Since you want to get the last row value from each item in the list, purrr::map_int is a good step:

s1 <- map_int(x1, ~ .x[nrow(.x), 1])

That will get you the vector that you’re after.

On the other hand, I tend to like regexes, so I would likely replace splitdesc with something like this:

splitdesc2 <- function(inp) {
  str_extract(inp, "^.{1,25} ") %>%
    str_trim()
}

The regex "^.{1,25} " finds the longest string 25 characters or less that is followed by a space.


#3

Thanks Nick…this works perfectly! I appreciate both options to solve the problem. I think I knew it wasn’t really a problem in mutate, but that was where it appeared in my process. I’m going to keep learning more about purr. This solves a problem that has racked my beginner brain for days.