Extracting value from a data frame where column name to extract from is dynamically determined by values in another column

preposterior · September 19, 2018, 3:24pm

I have a data frame where the last column is the name of a column within the data frame that I want to extract from. See this example:

library(tidyverse)
# Sample Data
df = tibble(x = c(1,2,3), y = c(4,5,6), columnIndicator = c("x","x","y"))

df %>%
  mutate(newValue = !!as.name(df$columnIndicator))

## desired value for df$newValue is c(1,2,6), but returns c(1,2,3)

The solution should be simple, but I am in what feels like a mental rut. Thanks for any advice

nwerth · September 19, 2018, 3:57pm

I think the easiest way is to make the data tidier. If each observation has either the x or y value, create observations for both and then filter to the ones you want.

df[["id"]] <- seq_len(nrow(df))

df %>%
  gather("column", "newValue", x, y) %>%
  filter(column == columnIndicator) %>%
  select(-columnIndicator, -column) %>%
  right_join(df, by = "id")
# # A tibble: 3 x 5
#      id newValue     x     y columnIndicator
#   <int>    <dbl> <dbl> <dbl> <chr>
# 1     1        1     1     4 x
# 2     2        2     2     5 x
# 3     3        6     3     6 y

JohnMount · September 19, 2018, 4:05pm

First- that is a brilliant question. Some sort of two-dimensional index is a fundamental question.

I'd be very interested to know if there is a direct (already in R, such as base::diag()) solution.

Here is one solution.

df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

# direct solution
sapply( 
  seq_len(nrow(df)), 
  function(i) { 
    df[i, df$columnIndicator[[i]], drop = TRUE ]
  })
# [1] 1 2 6

# wrapped as a re-usable function solution
two_dim_index <- function(d, idx1, idx2) {
  sapply( 
    seq_len(nrow(d)), 
    function(i) { 
      d[idx1[[i]], idx2[[i]], drop = TRUE ]
    })
}

df$newValue <- two_dim_index(df, 
                             seq_len(nrow(df)), 
                             df$columnIndicator)
print(df)
#   x y columnIndicator newValue
# 1 1 4               x        1
# 2 2 5               x        2
# 3 3 6               y        6

@nwerth's idea of building a composite key to join is also quite sound.

JohnMount · September 19, 2018, 4:31pm

Somebody just taught me the R-idiomatic answer: a composite index using cbind()!

df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

df$newValue <-
  df[cbind(
    seq_len(nrow(df)),
    match(df$columnIndicator, colnames(df))
  )]

print(df)

#   x y columnIndicator newValue
# 1 1 4               x        1
# 2 2 5               x        2
# 3 3 6               y        6

geospacedman · September 19, 2018, 4:36pm

Sadly that doesn't work if df is a tibble because all together now "tibbles are not a drop-in replacement for data frames". Also that code returns a character value for newValue because there's a character column in the data frame, but lop that out and everything is numeric again.

JohnMount · September 19, 2018, 4:48pm

Argh you are, of course, right. And also the one that taught me the solution (with the warning that I stupidly ignored).

Your drop-out method is the fix.

df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

dtmp <- df[, unique(df$columnIndicator), drop = FALSE]
df$newValue <-
  dtmp[cbind(
    seq_len(nrow(dtmp)),
    match(df$columnIndicator, colnames(dtmp))
  )]

str(df)
# 'data.frame':	3 obs. of  4 variables:
# $ x              : num  1 2 3
# $ y              : num  4 5 6
# $ columnIndicator: chr  "x" "x" "y"
# $ newValue       : num  1 2 6

And yes, I think it is fair to consider data.frame and matrix the natural R data structures.

geospacedman · September 19, 2018, 5:14pm

I'm surprised a data.table wizard hasn't been along to contribute some magic yet, but maybe they don't hang out here.

JohnMount · September 19, 2018, 5:35pm

I found data.table does have its own notation for this. Instead of a two-dimensional index it uses a vector of choices.

library("data.table")
df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

dt <- as.data.table(df)

dt[, newValue := dt[[columnIndicator]][.I], by=columnIndicator][]

nwerth · September 19, 2018, 6:12pm

You should be able to use .SD instead of dt and drop the [.I], unless there's some terrible corner case.

JohnMount · September 19, 2018, 6:27pm

There was some discussion on that in the thread. I was trying to figure out which one ended up being preferred. Notationally I like your suggestion.

library("data.table")
df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

dt <- as.data.table(df)

dt[, newValue := .SD[[columnIndicator]], by=columnIndicator][]

preposterior · September 19, 2018, 7:47pm

You are all fantastic!!! I love all the options and I think many of them will find their way into solutions for others. For me, I am a tidyverse purist and hence, selected the solution by @nwerth.

I cannot be grateful enough for this wonderful community with these great contributions from @nwerth @JohnMount and @geospacedman. Thanks!

JohnMount · September 20, 2018, 10:45am

my partner came up with a very good dplyr solution.

Working through how multiple systems do this I think does help people reason about working with data.

library(dplyr)
df = data.frame(x = c(1,2,3), 
                y = c(4,5,6), 
                columnIndicator = c("x","x","y"),
                stringsAsFactors = FALSE)

df %>% group_by(columnIndicator) %>%
  mutate(newValue = .data[[ columnIndicator[[1]] ]]) %>%
  ungroup()

# # A tibble: 3 x 4
#       x     y columnIndicator newValue
#   <dbl> <dbl> <chr>              <dbl>
# 1     1     4 x                      1
# 2     2     5 x                      2
# 3     3     6 y                      6