function for looking up value pairs in data frame

mkr · October 20, 2019, 7:07pm

Hi all,

I recently got into a problem with tidying up data after scraping a table off the web.

The basic problem is illustrated in the test_df

test_df <- tibble(
  x = c(NA, NA, "foobar", "foobar"),
  y = c("foobar", "foobar", "bar","bar"),
  z = c("bar", "foo", NA, NA)
)

# A tibble: 4 x 3
  x      y      z    
  <chr>  <chr>  <chr>
1 NA     foobar bar  
2 NA     foobar foo  
3 foobar bar    NA   
4 foobar bar    NA

I basically needed to add a new column named "foobar" with the values which occur right next to the "foobar" entry.

The desired output is:

# A tibble: 4 x 4
  x      y      z     foobar
  <chr>  <chr>  <chr> <chr> 
1 NA     foobar bar   bar   
2 NA     foobar foo   foo   
3 foobar bar    NA    bar   
4 foobar bar    NA    bar

I managed to do this with the following function:

select_row <- function(df, term){
  ifelse(!is.na(df$x) & df$x == term, df$y, 
         ifelse(!is.na(df$y) & df$y == term, df$z, NA))}

and then using mutate from dplyr:

test_df %>% mutate(foobar = select_row(., "foobar"))

This gives the desired output, but the function select_row() is not pretty. Firstly, the column names are hard-coded. Secondly, if the data frame is large and we expect the value pair to be in any column, the ifelse statement in the function would become really large as well.

Does anybody have an idea how this function could be written in a more elegant and reusable way, i.e. without a large ifelse-statement and without having the column names hard-coded.

Any input would be highly appreciated.

cderv · October 20, 2019, 7:56pm

This is how I would do it.

library(tidyverse)
test_df <- tibble(
  x = c(NA, NA, "foobar", "foobar"),
  y = c("foobar", "foobar", "bar","bar"),
  z = c("bar", "foo", NA, NA)
)

# create a function that works on a vec
next_term <- function(vec, term) {
  vec[pmin(which(vec == term) + 1, length(vec))]
}

next_term(as_vector(test_df[1, ]), "foobar")
#>     z 
#> "bar"
next_term(as_vector(test_df[3, ]), "foobar")
#>     y 
#> "bar"

# apply on each row using pmap and use lift_vd to handle dots    
test_df %>%
  mutate(foobar = pmap_chr(., lift_vd(next_term, term = "foobar")))
#> # A tibble: 4 x 4
#>   x      y      z     foobar
#>   <chr>  <chr>  <chr> <chr> 
#> 1 <NA>   foobar bar   bar   
#> 2 <NA>   foobar foo   foo   
#> 3 foobar bar    <NA>  bar   
#> 4 foobar bar    <NA>  bar

^{Created on 2019-10-20 by the reprex package (v0.3.0)}

I think first trying to create a function that works on a vector then apply it on each row using pmap is often a good way to deal with this kind of problem

andresrcs · October 20, 2019, 10:02pm

Another option

library(tidyverse)

test_df <- tibble(
    x = c(NA, NA, "foobar", "foobar"),
    y = c("foobar", "foobar", "bar","bar"),
    z = c("bar", "foo", NA, NA)
)

test_df %>%
    unite(foobar, everything(), remove = FALSE, sep = " ", na.rm = TRUE) %>% 
    mutate(foobar = str_extract(foobar, "(?<=foobar\\s)[^\\s]+(?=\\b)"))
#> # A tibble: 4 x 4
#>   foobar x      y      z    
#>   <chr>  <chr>  <chr>  <chr>
#> 1 bar    <NA>   foobar bar  
#> 2 foo    <NA>   foobar foo  
#> 3 bar    foobar bar    <NA> 
#> 4 bar    foobar bar    <NA>

system · November 10, 2019, 10:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.