Apply function to each row in a DF and create a new DF with the outputs

robertm · September 3, 2019, 8:32pm

I have the following dataframe:

df = data.frame(
  a = c(1, 2, 3),
  b = c(4, 5, 6),
  c = c(7, 8, 9)
)

which looks like:

##   a b c
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

Then I have the following function which expects a dataframe with only 1 row, and it basically returns a new dataframe with just 1 row, similar to the previous one, but with an extra column with the sum of the previous columns.

process_row = function(row) {
  row = row[1,]
  new_col_name = paste("[", paste(colnames(row), collapse = "+"), "]", sep = "")
  row[[new_col_name]] = sum(row[1,])
  return (row)
}

Please, assume that function cannot be changed and we don’t really know how it works inernally (like a black box). This is a simplification of another problem, so this is a requirement.

Let’s check how that function works:

row = df[1,]
row

##   a b c
## 1 1 4 7

row_processed = process_row(row)
row_processed

##   a b c [a+b+c]
## 1 1 4 7      12

Now, my goal is to apply that blackbox function to a dataframe with multiple rows, getting the same output as the following chunk of code:

# BEGIN OF BLOCK TO OPTIMIZE
row_processed = process_row(df[1,])
result = row_processed
for (i in 2:nrow(df)) {
  row_processed = process_row(df[i,])
  result = rbind(result, row_processed)
}
# END OF BLOCK TO OPTIMIZE

Let’s try it:

df

##   a b c
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

result

##   a b c [a+b+c]
## 1 1 4 7      12
## 2 2 5 8      15
## 3 3 6 9      18

I’m pretty sure we can get this in a more clear way, probably some function on the apply function familiy.

Thanks!

raytong · September 3, 2019, 11:04pm

Hi robertm. rowSums can do the sum of each row.

df[[paste0("[", paste(colnames(df), collapse = "+"), "]")]] <- rowSums(df)

robertm · September 4, 2019, 9:34am

@raytong you didn't use the function: process_row which was intended for you to use.

Thanks!

jcblum · September 4, 2019, 11:34am

Assuming your restrictions are exactly as strict as you have stated, it's good to bear in mind that this sort of operation is bound to be somewhat awkward and inefficient, since R's data frames are lists of columns, internally. That said, here are some examples of how to do this with a for loop, with lapply(), and with purrr::map_dfr().

df = data.frame(
  a = c(1, 2, 3),
  b = c(4, 5, 6),
  c = c(7, 8, 9)
)

process_row = function(row) {
  row = row[1,]
  new_col_name = paste("[", paste(colnames(row), collapse = "+"), "]", sep = "")
  row[[new_col_name]] = sum(row[1,])
  return (row)
}

# For Loop
df_list <- vector(mode = "list", length = nrow(df))
for (r_num in 1:nrow(df)) {
  df_list[[r_num]] <- process_row(df[r_num, ])
}
df_out_loop <- do.call(rbind, df_list)

# lapply
df_list <- lapply(1:nrow(df), function(r_num) { process_row(df[r_num, ]) })
df_out_lapply <- do.call(rbind, df_list)

# purrr::map_dfr
library(purrr)
df_out_map <- map_dfr(
  1:nrow(df),
  ~ process_row(df[.x, ])
)

df_out_loop
#>   a b c [a+b+c]
#> 1 1 4 7      12
#> 2 2 5 8      15
#> 3 3 6 9      18
df_out_lapply
#>   a b c [a+b+c]
#> 1 1 4 7      12
#> 2 2 5 8      15
#> 3 3 6 9      18
df_out_map
#>   a b c [a+b+c]
#> 1 1 4 7      12
#> 2 2 5 8      15
#> 3 3 6 9      18

identical(df_out_loop, df_out_lapply)
#> [1] TRUE
identical(df_out_lapply, df_out_map)
#> [1] TRUE

^{Created on 2019-09-04 by the reprex package (v0.3.0)}

Edited to add:

In the interest of knowing what you're signing up for, a small illustration of the unsatisfying performance of all of the above patterns…

options(width = 100)
library(microbenchmark)

set.seed(42)
df <- data.frame(
  a = sample.int(9L, size = 1000, replace = TRUE),
  b = sample.int(9L, size = 1000, replace = TRUE),
  c = sample.int(9L, size = 1000, replace = TRUE)
)

process_row = function(row) {
  row = row[1,]
  new_col_name = paste("[", paste(colnames(row), collapse = "+"), "]", sep = "")
  row[[new_col_name]] = sum(row[1,])
  return (row)
}

process_row.loop <- function(df) {
  df_list <- vector(mode = "list", length = nrow(df))
  for (r_num in 1:nrow(df)) {
    df_list[[r_num]] <- process_row(df[r_num, ])
  }
  df_out <- do.call(rbind, df_list)
}

process_row.lapply <- function(df) {
  df_list <- lapply(1:nrow(df), function(r_num) { process_row(df[r_num, ]) })
  df_out <- do.call(rbind, df_list)
}

process_row.map <- function(df) {
  df_out <- purrr::map_dfr(
    1:nrow(df),
    ~ process_row(df[.x, ])
  )
}

microbenchmark(process_row.loop(df), process_row.lapply(df), process_row.map(df))
#> Unit: milliseconds
#>                    expr      min       lq     mean   median       uq      max neval
#>    process_row.loop(df) 165.3589 177.6002 188.0373 183.7816 193.7821 237.3687   100
#>  process_row.lapply(df) 165.6829 181.6551 195.8358 191.7518 205.2548 266.0621   100
#>     process_row.map(df) 138.2754 150.5103 164.0445 158.1215 169.7743 295.5430   100

A more flexible process_row() makes a big difference in performance. If the function can operate on a vector instead of a single-row data frame, you gain the option of using apply(), which is dramatically faster than any option requiring row-binding single-row data frames.

process_row2 = function(row) {
  # Arbitrary operations on vector and its elements
  row["a"] * row["b"] + row["c"]^2 / sum(row)
}

process_row2.apply <- function(df) {
  df["processed"] <- apply(df, 1, process_row2)
  df
}
process_row2.apply(head(df, 3))
#>   a b c processed
#> 1 1 7 3  7.818182
#> 2 5 9 9 48.521739
#> 3 1 7 1  7.111111

microbenchmark(process_row2.apply(df))
#> Unit: milliseconds
#>                    expr      min      lq     mean   median       uq      max neval
#>  process_row2.apply(df) 3.857863 3.97732 4.384639 4.039512 4.159503 6.757526   100

^{Created on 2019-09-04 by the reprex package (v0.3.0)}

raytong · September 4, 2019, 11:35am

@robertm If the process_row must be use, try the following script

sapply(1:nrow(df), function(x){process_row(df[x,])}) %>%
  t()

The sapply will simplify the result to table by column and transpose it will do.

An alternative method with no simplify to table and do.call the resulting list by rbind

sapply(1:nrow(df), function(x){process_row(df[x,])}, simplify =  FALSE) %>%
{do.call(rbind, .)}

Scoco · September 11, 2019, 4:34pm

You're correct that the apply family is your friend.
If your data.frame is all numeric, then you can do it with apply on the matrix with a slightly modified version of process_row:

 process_row = function(row) {
     new_col_name = paste("[", paste(names(row), collapse = "+"), "]", sep = "")
     row[[new_col_name]] = sum(row)
     return (row)
   }
R> data.frame(t(apply(as.matrix(df), 1, process_row)), check.names = FALSE)
  a b c [a+b+c]
1 1 4 7      12
2 2 5 8      15
3 3 6 9      18

A similar formulation would work for any data.frame where all columns are the same type so as.matrix() works. Note that within apply each row comes in as a vector, not a 1xn matrix so we need to use names() instead of rownames() if you want to use them in the output.

Scoco · September 11, 2019, 4:39pm

The non-tidyverse version of @raytong's reply would be:

do.call(rbind, sapply(1:nrow(df), function(x) process_row(df[x,]), simplify = FALSE)

or with fewer keystrokes using lapply

do.call(rbind, lapply(1:nrow(df), function(x) process_row(df[x,])))

system · October 2, 2019, 4:39pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.