How to apply function on every 3 columns in dataframe ?

How can we apply the function for every 3 columns in dataframe.
Example is as below.
I am not sure how dplyr or map or purrr could be used here to
our motto "Do not repeat yourself"

df <- data.frame(a = runif(10, 2.0, 7.5),
           b = runif(10, 5.0, 9.5),
           c = runif(10, 1.0, 9.5),
           d = runif(10, 2.0, 9.1),
           e = runif(10, 3.0, 9.2),
           f = runif(10, 4.0, 9.3),
           g = runif(10, 5.0, 9.4),
           h = runif(10, 6.0, 9.5),
           i = runif(10, 7.0, 9.6),
           j = runif(10, 8.0, 9.7),
           k = runif(10, 9.0, 9.8),
           l = runif(10, 1.0, 9.9)
           )
df[,1:3] = sweep(df[,1:3],1,rowSums(df[,1:3]),`/`)*100
df[,4:6] = sweep(df[,4:6],1,rowSums(df[,4:6]),`/`)*100
df[,7:9] = sweep(df[,7:9],1,rowSums(df[,7:9]),`/`)*100
df[,10:12] = sweep(df[,10:12],1,rowSums(df[,10:12]),`/`)*100

Can we have a single function to apply sweep function for every 3 columns instead of hardcoding with 3 extra lines as above ?
Thanks in advance

1 Like

Will you allow base R answers? Then, it's an ugly solution:

df <- data.frame(a = runif(10, 2.0, 7.5),
                 b = runif(10, 5.0, 9.5),
                 c = runif(10, 1.0, 9.5),
                 d = runif(10, 2.0, 9.1),
                 e = runif(10, 3.0, 9.2),
                 f = runif(10, 4.0, 9.3),
                 g = runif(10, 5.0, 9.4),
                 h = runif(10, 6.0, 9.5),
                 i = runif(10, 7.0, 9.6),
                 j = runif(10, 8.0, 9.7),
                 k = runif(10, 9.0, 9.8),
                 l = runif(10, 1.0, 9.9))

# main modification
df_mod <- sapply(X = 0:((ncol(x = df) / 3) - 1),
                 FUN = function(t)
                 {
                   temp <- df[, (3 * t) + 1:3]
                   (temp / rowSums(x = temp)) * 100
                 })

# Given in question
df[, 1:3] = sweep(df[, 1:3], 1, rowSums(df[, 1:3]), `/`) * 100
df[, 4:6] = sweep(df[, 4:6], 1, rowSums(df[, 4:6]), `/`) * 100
df[, 7:9] = sweep(df[, 7:9], 1, rowSums(df[, 7:9]), `/`) * 100
df[, 10:12] = sweep(df[, 10:12], 1, rowSums(df[, 10:12]), `/`) * 100

# method 1: using idea from https://stackoverflow.com/a/4227483/11117265
df_1 <- do.call(what = cbind.data.frame,
                args = df_mod)
names(x = df_1) <- names(x = df)
all.equal(target = df_1,
          current = df)
#> [1] TRUE

# method 2
df_2 <- cbind.data.frame(lapply(X = df_mod,
                                FUN = cbind))
names(x = df_2) <- names(x = df)
all.equal(target = df_2,
          current = df)
#> [1] TRUE

Created on 2019-03-07 by the reprex package (v0.2.1)

Thanks @Yarnabrina for solution.
Although I struggle to understand how base R functions work.
It would be much easier to understand the working if this was done via map/purrr/dplyr.

Probably, I wait for a day if there is non-base solution to accept an answer. Else, I will definitely accept your solution.

Here's a somewhat flexible way to do this with purrr and dplyr

col_windows_row_sums <- function(data, window_size = 3){
  
  col_indices <- 1:ncol(data)
  
  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      data[,.x] %>%
        as_tibble() %>%
        mutate(row_sum = rowSums(.)) %>%
        mutate_all(vars(. / row_sum)) %>%
        select(-row_sum)
    }) %>%
    bind_cols() %>%
    set_names(colnames(data))
  
}

df %>%
  col_windows_row_sums()

This makes use of a lovely method for chunking vectors from here:

2 Likes

Thanks for solution, @martinjhnhadley.
In order to display as percentage, I multiplied here by 100
but it seems the value obtained is truncated to 2 digit.
For example: First element is 27.17072 but even with this adaptation
27.2 is displayed.
I am not sure where does this floating point conversion could be added.

mutate_all(vars(. / row_sum * 100))

Ah, I did forget to * 100. The reason you're seeing only 27.1 is because tibble tries to be clever in what it shows you. Let's prove no rounding has been done as follows:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  pluck(1)

The package that controls how tibbles are displayed is called pillar and is documented here https://pillar.r-lib.org/

2 Likes

pluck(1) will work perfectly but only 1st row will be with floating point.
Is there a way, we can incorporate the pluck within function. So that, Complete dataframe is in floating point ?
tried:

  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      df[,.x] %>%
        as_tibble() %>% **pluck(1)** %>%

But error

Ah, I understand the confusion here. pluck is a utility function for extracting columns from tibbles, it's useful because otherwise we need to use the ugly looking .[[1]]

df %>%
  col_windows_row_sums(window_size = 3) %>%
  select(a) %>%
  .[[1]]

df %>%
  col_windows_row_sums(window_size = 3) %>%
  pluck(1)

The thing that's shown in your console is the printed version of the tibble, which hides the decimals after 0.1 for your convenience. If we use View() you'll see the numbers haven't really been rounded:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  View()

You could convert this into a data.frame which doesn't do this fancy business:

df %>%
  col_windows_row_sums(window_size = 3) %>%
  data.frame()

However, I would not recommend it. tibbles are better than data.frames.

2 Likes

Perfect @martinjhnhadley
Excellent solution :slight_smile:
Hat's off

1 Like

You're very welcome! This was a fun thing for me to code up, and I'm pleased to help you with the quite confusing appearance of tibbles. Happy coding

1 Like

Typo in the above code:

  col_indices <- 1:ncol(data)
  
  split(col_indices, ceiling(seq_along(col_indices)/window_size)) %>%
    map(~{
      data[,.x] %>% # Not df instead it is data as it is the argument to the function
1 Like

I really like @martinjhnhadley's approach! I'll build on it to provide a solution that doesn't use bind_cols(), because my personal experience has been that it makes my code harder to maintain.

df <- mutate(df, row_id = row_number()) %>% 
  gather('col_name', 'value', -row_id) %>% 
  add_count(rep(groups, each = values_per_group), row_id,
            wt = value, 
            name = 'group_rowsum') %>% 
  transmute(row_id, 
            col_name, 
            value = 100 * value / group_rowsum) %>% 
  spread(col_name, value) %>% 
  select(-row_id)

The sum you're calculating involves two indices: one for the rows, and one for the column groups. This pipeline avoids explicit row-wise operations by creating those indices and using them as grouping variables in add_count(), which does the summation through wt = value.

Full reprex below. Cheers!

library(tidyverse)
set.seed(123)

n_rows <- 10
args <- list(c(2.0, 7.5), c(5.0, 9.5), c(1.0, 9.5), c(2.0, 9.1), c(3.0, 9.2), c(4.0, 9.3), c(5.0, 9.4), c(6.0, 9.5), c(7.0, 9.6), c(8.0, 9.7), c(9.0, 9.8), c(1.0, 9.9)) %>% 
  map(set_names, c('min', 'max'))
df <- map(args, ~ exec(runif, !!!.x, n = n_rows)) %>% 
  set_names(letters[1:length(args)]) %>% 
  as_tibble()

size <- 3
groups <- 1:ceiling(ncol(df)/size)
values_per_group <- n_rows*size

df <- mutate(df, row_id = row_number()) %>% 
  gather('col_name', 'value', -row_id) %>% 
  add_count(rep(groups, each = values_per_group), row_id,
            wt = value, 
            name = 'group_rowsum') %>% 
  transmute(row_id, 
            col_name, 
            value = 100 * value / group_rowsum) %>% 
  spread(col_name, value) %>% 
  select(-row_id)

Created on 2019-03-11 by the reprex package (v0.2.1)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.