How to apply function on every 3 columns in dataframe ?

AbhishekHP · March 7, 2019, 2:15pm

How can we apply the function for every 3 columns in dataframe.
Example is as below.
I am not sure how dplyr or map or purrr could be used here to
our motto "Do not repeat yourself"

df <- data.frame(a = runif(10, 2.0, 7.5),
           b = runif(10, 5.0, 9.5),
           c = runif(10, 1.0, 9.5),
           d = runif(10, 2.0, 9.1),
           e = runif(10, 3.0, 9.2),
           f = runif(10, 4.0, 9.3),
           g = runif(10, 5.0, 9.4),
           h = runif(10, 6.0, 9.5),
           i = runif(10, 7.0, 9.6),
           j = runif(10, 8.0, 9.7),
           k = runif(10, 9.0, 9.8),
           l = runif(10, 1.0, 9.9)
           )
df[,1:3] = sweep(df[,1:3],1,rowSums(df[,1:3]),`/`)*100
df[,4:6] = sweep(df[,4:6],1,rowSums(df[,4:6]),`/`)*100
df[,7:9] = sweep(df[,7:9],1,rowSums(df[,7:9]),`/`)*100
df[,10:12] = sweep(df[,10:12],1,rowSums(df[,10:12]),`/`)*100

Can we have a single function to apply sweep function for every 3 columns instead of hardcoding with 3 extra lines as above ?
Thanks in advance

Yarnabrina · March 7, 2019, 3:18pm

Will you allow base R answers? Then, it's an ugly solution:

df <- data.frame(a = runif(10, 2.0, 7.5),
                 b = runif(10, 5.0, 9.5),
                 c = runif(10, 1.0, 9.5),
                 d = runif(10, 2.0, 9.1),
                 e = runif(10, 3.0, 9.2),
                 f = runif(10, 4.0, 9.3),
                 g = runif(10, 5.0, 9.4),
                 h = runif(10, 6.0, 9.5),
                 i = runif(10, 7.0, 9.6),
                 j = runif(10, 8.0, 9.7),
                 k = runif(10, 9.0, 9.8),
                 l = runif(10, 1.0, 9.9))

# main modification
df_mod <- sapply(X = 0:((ncol(x = df) / 3) - 1),
                 FUN = function(t)
                 {
                   temp <- df[, (3 * t) + 1:3]
                   (temp / rowSums(x = temp)) * 100
                 })

# Given in question
df[, 1:3] = sweep(df[, 1:3], 1, rowSums(df[, 1:3]), `/`) * 100
df[, 4:6] = sweep(df[, 4:6], 1, rowSums(df[, 4:6]), `/`) * 100
df[, 7:9] = sweep(df[, 7:9], 1, rowSums(df[, 7:9]), `/`) * 100
df[, 10:12] = sweep(df[, 10:12], 1, rowSums(df[, 10:12]), `/`) * 100

# method 1: using idea from https://stackoverflow.com/a/4227483/11117265
df_1 <- do.call(what = cbind.data.frame,
                args = df_mod)
names(x = df_1) <- names(x = df)
all.equal(target = df_1,
          current = df)
#> [1] TRUE

# method 2
df_2 <- cbind.data.frame(lapply(X = df_mod,
                                FUN = cbind))
names(x = df_2) <- names(x = df)
all.equal(target = df_2,
          current = df)
#> [1] TRUE

^{Created on 2019-03-07 by the reprex package (v0.2.1)}

AbhishekHP · March 7, 2019, 3:23pm

Thanks @Yarnabrina for solution.
Although I struggle to understand how base R functions work.
It would be much easier to understand the working if this was done via map/purrr/dplyr.

Probably, I wait for a day if there is non-base solution to accept an answer. Else, I will definitely accept your solution.

martinjhnhadley · March 7, 2019, 4:28pm

Here's a somewhat flexible way to do this with purrr and dplyr

col_windows_row_sums <- function(data, window_size = 3){
  
  col_indices <- 1:ncol(data)
  
  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      data[,.x] %>%
        as_tibble() %>%
        mutate(row_sum = rowSums(.)) %>%
        mutate_all(vars(. / row_sum)) %>%
        select(-row_sum)
    }) %>%
    bind_cols() %>%
    set_names(colnames(data))
  
}

df %>%
  col_windows_row_sums()

This makes use of a lovely method for chunking vectors from here:

AbhishekHP · March 7, 2019, 4:38pm

Thanks for solution, @martinjhnhadley.
In order to display as percentage, I multiplied here by 100
but it seems the value obtained is truncated to 2 digit.
For example: First element is 27.17072 but even with this adaptation
27.2 is displayed.
I am not sure where does this floating point conversion could be added.

mutate_all(vars(. / row_sum * 100))

martinjhnhadley · March 7, 2019, 4:45pm

Ah, I did forget to * 100. The reason you're seeing only 27.1 is because tibble tries to be clever in what it shows you. Let's prove no rounding has been done as follows:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  pluck(1)

The package that controls how tibbles are displayed is called pillar and is documented here https://pillar.r-lib.org/

AbhishekHP · March 7, 2019, 4:50pm

pluck(1) will work perfectly but only 1st row will be with floating point.
Is there a way, we can incorporate the pluck within function. So that, Complete dataframe is in floating point ?
tried:

  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      df[,.x] %>%
        as_tibble() %>% **pluck(1)** %>%

But error

martinjhnhadley · March 7, 2019, 4:56pm

Ah, I understand the confusion here. pluck is a utility function for extracting columns from tibbles, it's useful because otherwise we need to use the ugly looking .[[1]]

df %>%
  col_windows_row_sums(window_size = 3) %>%
  select(a) %>%
  .[[1]]

df %>%
  col_windows_row_sums(window_size = 3) %>%
  pluck(1)

The thing that's shown in your console is the printed version of the tibble, which hides the decimals after 0.1 for your convenience. If we use View() you'll see the numbers haven't really been rounded:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  View()

You could convert this into a data.frame which doesn't do this fancy business:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  data.frame()

However, I would not recommend it. tibbles are better than data.frames.

AbhishekHP · March 7, 2019, 4:59pm

Perfect @martinjhnhadley
Excellent solution
Hat's off

martinjhnhadley · March 7, 2019, 5:00pm

You're very welcome! This was a fun thing for me to code up, and I'm pleased to help you with the quite confusing appearance of tibbles. Happy coding

AbhishekHP · March 11, 2019, 12:42pm

Typo in the above code:

  col_indices <- 1:ncol(data)
  
  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      data[,.x] %>% # Not df instead it is data as it is the argument to the function

nathania · March 11, 2019, 8:45pm

I really like @martinjhnhadley's approach! I'll build on it to provide a solution that doesn't use bind_cols(), because my personal experience has been that it makes my code harder to maintain.

df <- mutate(df, row_id = row_number()) %>% 
  gather('col_name', 'value', -row_id) %>% 
  add_count(rep(groups, each = values_per_group), row_id,
            wt = value, 
            name = 'group_rowsum') %>% 
  transmute(row_id, 
            col_name, 
            value = 100 * value / group_rowsum) %>% 
  spread(col_name, value) %>% 
  select(-row_id)

The sum you're calculating involves two indices: one for the rows, and one for the column groups. This pipeline avoids explicit row-wise operations by creating those indices and using them as grouping variables in add_count(), which does the summation through wt = value.

Full reprex below. Cheers!

library(tidyverse)
set.seed(123)

n_rows <- 10
args <- list(c(2.0, 7.5), c(5.0, 9.5), c(1.0, 9.5), c(2.0, 9.1), c(3.0, 9.2), c(4.0, 9.3), c(5.0, 9.4), c(6.0, 9.5), c(7.0, 9.6), c(8.0, 9.7), c(9.0, 9.8), c(1.0, 9.9)) %>% 
  map(set_names, c('min', 'max'))
df <- map(args, ~ exec(runif, !!!.x, n = n_rows)) %>% 
  set_names(letters[1:length(args)]) %>% 
  as_tibble()

size <- 3
groups <- 1:ceiling(ncol(df)/size)
values_per_group <- n_rows*size

df <- mutate(df, row_id = row_number()) %>% 
  gather('col_name', 'value', -row_id) %>% 
  add_count(rep(groups, each = values_per_group), row_id,
            wt = value, 
            name = 'group_rowsum') %>% 
  transmute(row_id, 
            col_name, 
            value = 100 * value / group_rowsum) %>% 
  spread(col_name, value) %>% 
  select(-row_id)

^{Created on 2019-03-11 by the reprex package (v0.2.1)}

system · March 18, 2019, 8:45pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.