How do I assign a vector to a subset of rows of a dataframe using the tidyverse?

Andrea · March 27, 2020, 2:42pm

I'm very surprised that there isn't a simple way (or at least, I couldn't find one through the usual channels) to assign a vector to a noncontiguous subset of rows in a dataframe, using dplyr. Example:

RMSE_table <- data.frame(model = rep(c("exponential", "polynomial"), times = 4), 
                         type = rep(c("cumulative","new"), each = 4), 
                         data = rep(rep(c("cases","deaths"), each = 2), times = 2), 
                         RMSE = numeric(8))

x <- c(1,3)

This (w/o dplyr) works, and it's quite readable, even though I could do without the repetitions:

> RMSE_table[RMSE_table$model == "exponential" & RMSE_table$type == "cumulative", "RMSE"] <- x
> RMSE_table
        model       type   data RMSE
1 exponential cumulative  cases    1
2  polynomial cumulative  cases    0
3 exponential cumulative deaths    3
4  polynomial cumulative deaths    0
5 exponential        new  cases    0
6  polynomial        new  cases    0
7 exponential        new deaths    0
8  polynomial        new deaths    0

This doesn't:

> library(dplyr)
> library(magrittr)
> RMSE_table %<>%
+ mutate(RMSE = ifelse(model == "exponential" & type == "cumulative", x, RMSE))
> RMSE_table
        model       type   data RMSE
1 exponential cumulative  cases    1
2  polynomial cumulative  cases    0
3 exponential cumulative deaths    1
4  polynomial cumulative deaths    0
5 exponential        new  cases    0
6  polynomial        new  cases    0
7 exponential        new deaths    0
8  polynomial        new deaths    0

Am I missing something obvious?

Yarnabrina · March 27, 2020, 5:13pm

If I understand the question correctly, this is not a problem of dplyr.

While you use this: RMSE_table[RMSE_table$model == "exponential" & RMSE_table$type == "cumulative", "RMSE"] <- x, you are replacing a vector (subset of a larger vector) with another shorter vector x, which will be replicated as required to match the length. Since here its length matches exactly the length of the subset, it seems to work perfectly.

However, when you use mutate(RMSE = ifelse(model == "exponential" & type == "cumulative", x, RMSE)), you are replacing the existing vector with another one created by ifelse. The way ifelse works, it has replicated x to match the length of RMSE. So, in the positions where the evaluation turns out to be TRUE, it so happens that both are in odd position and hence the first element of x is used at the corresponding positions.

You can use replace in this case:

> library(dplyr)
> RMSE_table <- tibble(model = rep(x = c("exponential", "polynomial"), times = 4), 
+                      type = rep(x = c("cumulative","new"), each = 4), 
+                      data = rep(x = c("cases","deaths"), each = 2, times = 2), 
+                      RMSE = numeric(length = 8))
> RMSE_table %>%
+     mutate(RMSE = replace(x = RMSE,
+                           list = (model == "exponential" & type == "cumulative"),
+                           values = c(1, 3)))
# A tibble: 8 x 4
  model       type       data    RMSE
  <chr>       <chr>      <chr>  <dbl>
1 exponential cumulative cases      1
2 polynomial  cumulative cases      0
3 exponential cumulative deaths     3
4 polynomial  cumulative deaths     0
5 exponential new        cases      0
6 polynomial  new        cases      0
7 exponential new        deaths     0
8 polynomial  new        deaths     0

Hope this helps.

dromano · March 27, 2020, 5:23pm

Hi @Andrea: Here's another possibility:

RMSE_table %>% 
  group_by(model, type) %>% 
  mutate(
    RMSE =
      case_when(
        model == "exponential" & type == "cumulative" ~ x,
        TRUE ~ RMSE
      ))

Andrea · March 27, 2020, 8:32pm

This works! But could you explain me why group_by is needed? I actually tried case_when as a first attempt, but it didn't work by itself. It looks like you need to combine it with group_by in order to make this work.

Andrea · March 27, 2020, 8:38pm

This works flawlessly! So, why does replace work here? If I understand correctly, it takes the vector RMSE, it finds the indices in RMSE corresponding to the content of list (which in this case is a logical vector) and then it substitutes the corresponding values in RMSE with the values in values, recyclying values if needed. Correct?

dromano · March 28, 2020, 3:19am

Thanks for the interesting puzzle, @Andrea!

You can think of each ~ expression in case_when() as applying replace() to a logical vector:

x <- c(1, 3)
bool  <- sample(c(T, F), 8, replace = T)
bool
#> [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

replace(bool, bool, which(bool))
#> [1] 1 0 3 0 0 6 7 8
case_when(bool ~ 1:8)
#> [1]  1 NA  3 NA NA  6  7  8

with the difference that case_when() --- as a 'tidy' function --- requires that the replacement vector be of the same length as bool (or of length 1, and then recycled to the length of bool).

In the case of RMSE_table, since the replacement vector, x, is of length 2, the group_by() command is necessary so that the vector

model == "exponential" & type == "cumulative"

is also of length 2, since each subtable created by group_by(model, type) has exactly two rows:

library(tidyverse)
RMSE_table <- 
  data.frame(
    model = rep(c("exponential", "polynomial"), times = 4), 
    type = rep(c("cumulative","new"), each = 4), 
    data = rep(rep(c("cases","deaths"), each = 2), times = 2), 
    RMSE = numeric(8)
  )

RMSE_table %>% 
  count(model, type)
#> # A tibble: 4 x 3
#>   model       type           n
#>   <fct>       <fct>      <int>
#> 1 exponential cumulative     2
#> 2 exponential new            2
#> 3 polynomial  cumulative     2
#> 4 polynomial  new            2

@Yarnabrina's replace() solution is more general, but doesn't work if the logical vector contains an NA:

RMSE_table %>% 
  add_row(RMSE = 0)
#>         model       type   data RMSE
#> 1 exponential cumulative  cases    0
#> 2  polynomial cumulative  cases    0
#> 3 exponential cumulative deaths    0
#> 4  polynomial cumulative deaths    0
#> 5 exponential        new  cases    0
#> 6  polynomial        new  cases    0
#> 7 exponential        new deaths    0
#> 8  polynomial        new deaths    0
#> 9        <NA>       <NA>   <NA>    0

RMSE_table %>% 
  add_row(RMSE = 0) %>% 
  mutate(
    RMSE =
      replace(x = RMSE,                         
              list = model == "exponential" & type == "cumulative",
              values = c(1, 3)
      )
  )     
#> Error in x[list] <- values: NAs are not allowed in subscripted assignments

A more general version of the solution I posted is possible if the issues caused by the length of x can be circumvented:

RMSE_table %>% 
 # add_row(RMSE = 0) %>%
 group_by(model, type) %>%
 mutate(
   RMSE =
     case_when(
       model == "exponential" & type == "cumulative" ~ 
         x[1:n()],
       TRUE ~ RMSE
     )
 )
#> # A tibble: 8 x 4
#> # Groups:   model, type [4]
#>   model       type       data    RMSE
#>   <fct>       <fct>      <fct>  <dbl>
#> 1 exponential cumulative cases      1
#> 2 polynomial  cumulative cases      0
#> 3 exponential cumulative deaths     3
#> 4 polynomial  cumulative deaths     0
#> 5 exponential new        cases      0
#> 6 polynomial  new        cases      0
#> 7 exponential new        deaths     0
#> 8 polynomial  new        deaths     0

RMSE_table %>% 
 add_row(RMSE = 0) %>%
 group_by(model, type) %>%
 mutate(
   RMSE =
     case_when(
       model == "exponential" & type == "cumulative" ~ 
         x[1:n()],
       TRUE ~ RMSE
     )
 )
#> Warning: Factor `model` contains implicit NA, consider using
#> `forcats::fct_explicit_na`
#> # A tibble: 9 x 4
#> # Groups:   model, type [5]
#>   model       type       data    RMSE
#>   <fct>       <fct>      <fct>  <dbl>
#> 1 exponential cumulative cases      1
#> 2 polynomial  cumulative cases      0
#> 3 exponential cumulative deaths     3
#> 4 polynomial  cumulative deaths     0
#> 5 exponential new        cases      0
#> 6 polynomial  new        cases      0
#> 7 exponential new        deaths     0
#> 8 polynomial  new        deaths     0
#> 9 <NA>        <NA>       <NA>       0

^{Created on 2020-03-27 by the reprex package (v0.3.0)}

Andrea · April 4, 2020, 3:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.