modelr::permutate dataframe output

applesauce · April 5, 2019, 1:49am

I'm not sure if modelr::permute was designed to do so (and if not, would be curious if you know similar functions), but I was hoping to generate the dataframe-equivalents of the 'perm' column; I'm hoping to compile these permuted dataframes into a single dataframe, as opposed to performing statistics on each permutation. I had considered running a loop of the 'sample' function, but that in theory could lead to duplicate outputs.

Thanks!

# an example of the current output—I'm hoping to extract the values from the 'perm' column one way or another
mtcars %>%
     mutate(am = factor(am)) %>% # factor am
     modelr::permute(8, am) # permute the am column `8` times

# A tibble: 8 x 2
#  perm              .id  
#  <list>            <chr>
# 1 <S3: permutation> 1    
# 2 <S3: permutation> 2    
# 3 <S3: permutation> 3    
# 4 <S3: permutation> 4    
# 5 <S3: permutation> 5    
# 6 <S3: permutation> 6    
# 7 <S3: permutation> 7    
# 8 <S3: permutation> 8

gueyenono · April 5, 2019, 4:50am

Hey @applesauce,

Welcome to our wonderful community. It's good to have you here

Would you please elaborate a bit more on why you would like to combine all the permutations into a single data frame? I am going to show you a hack below to achieve just that, but I would like to understand so I can advise you on whether your approach is ideal.

library(modelr)
library(purrr)
library(dplyr)

set.seed(123)

df <- mtcars %>%
  modelr::permute(8, am)

big_df <- df$perm %>% 
  map_df(function(x){
    x$data %>%
      mutate(am = x$idx)
  })

head(big_df)

   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0 10    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0 25    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1 13    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1 26    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0 27    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  2    3    1

tail(big_df)

     mpg cyl  disp  hp drat    wt qsec vs am gear carb
251 26.0   4 120.3  91 4.43 2.140 16.7  0 23    5    2
252 30.4   4  95.1 113 3.77 1.513 16.9  1  7    5    2
253 15.8   8 351.0 264 4.22 3.170 14.5  0 26    5    4
254 19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
255 15.0   8 301.0 335 3.54 3.570 14.6  0  5    5    8
256 21.4   4 121.0 109 4.11 2.780 18.6  1 30    4    2

Hope this helps.

applesauce · April 5, 2019, 5:02am

haha i knew i loved the r community. yeah so more specifically, i have some time series data, you can think of it as like a set of 48 different intervals (multiplied by 13 different people, and multiplied again by 2 because for each interval there can be one of two types of stimuli; 48x13x2=1248) (it's hard to post example code, though i can if necessary, because it's part of an ongoing study); i've taken the fft (fast-fourier transform) of this data, and now i'm trying to create several thousand essentially null hypotheses (that is, take my original data, scramble the relationship between accuracy (our dependent variable) and the specific interval in the time series), then find the fft for each of these several thousand null hypotheses, and then finally calculate the mean amplitude for each of these 48 intervals across the several thousand null hypotheses (which i'll then use as a comparison to my original data for statistical analysis); please do let me know if i should provide some actual code, because i know that's usually the most helpful; and thanks so much!

gueyenono · April 5, 2019, 5:12am

Well, it looks like getting a single data frame is relevant to your study and that's what's important. Try to run the code I provided and let me know if that's what you were looking for.

applesauce · April 5, 2019, 5:12am

sounds like a plan, thanks again!

applesauce · April 5, 2019, 5:36am

To follow up, I'm hoping this doesn't throw a wrench in things but to go back to the mtcars example—would it be possible to first group_by, say, gear and carb value, and then only shuffle am values between rows with matching gear and carb values? that's as opposed to shuffling am values between all rows

gueyenono · April 5, 2019, 6:19am

The shuffle() function below will help you in your tasks.

.data: the dataset
n: the number of permutations that you would like to perform
perm_cols: (character vector) names of columns you would like to use for the permutations

library(purrr)

shuffle <- function(.data, n, perm_cols){
  
  cols_ids <- match(perm_cols, colnames(.data))
  ids <- seq_len(nrow(.data))
  n_ids <- rerun(n, sample(ids))
  
  map_dfr(n_ids, function(x){
    .data[ids, cols_ids] <- .data[x, cols_ids]
    .data
  })
  
}

set.seed(123)
df <- shuffle(.data = mtcars, n = 10, perm_cols = c("am", "gear", "carb"))

head(df)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  0    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  0    3    2
3 22.8   4  108  93 3.85 2.320 18.61  1  0    3    3
4 21.4   6  258 110 3.08 3.215 19.44  1  1    4    1
5 18.7   8  360 175 3.15 3.440 17.02  0  1    5    2
6 18.1   6  225 105 2.76 3.460 20.22  1  1    4    4

tail(df)
     mpg cyl  disp  hp drat    wt qsec vs am gear carb
315 26.0   4 120.3  91 4.43 2.140 16.7  0  0    3    3
316 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    6
317 15.8   8 351.0 264 4.22 3.170 14.5  0  0    4    4
318 19.7   6 145.0 175 3.62 2.770 15.5  0  0    3    4
319 15.0   8 301.0 335 3.54 3.570 14.6  0  1    4    1
320 21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

applesauce · April 5, 2019, 6:45am

just to make sure I follow along, which part of the shuffle function indicates "am" is the variable we want to shuffle and "gear" and "carb" are the variables we'd want to group_by/shuffle "am" only between rows containing identical "gear" and "carb" values?

gueyenono · April 5, 2019, 7:10am

I don't know if you realize it, but what you are requesting is to actually shuffle all 3 columns together "in the same way " (i.e. am, gear and carb) and then keep all other columns as they are. The function:

Creates a sequence of integers from 1 to the number of rows of the dataset. This is essentially a vector of indices.

ids <- seq_len(nrow(.data))

generates n permutations of the vector of IDs

n_ids <- rerun(n, sample(ids))

shuffles the specified columns inside the dataset (without affecting the other columns)

map_dfr(n_ids, function(x){
  .data[ids, cols_ids] <- .data[x, cols_ids]
  .data
})

You could try it on a sample manually-made dataset to see how it works.

applesauce · April 5, 2019, 7:31am

Oh my goodness you're completely right. And testing the code on my data works! (wow that makes so much sense actually) Thank you forever much for all your time and help. I guess liking your post is the closest I can get to giving you proper credit, I so appreciate it!

gueyenono · April 5, 2019, 7:43am

I'm happy I could help

system · April 12, 2019, 7:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.