Running a "function" across an array (through data.frames)

dplyr
tidyverse

#1

##################
EDIT
##################
I haven't been able to find a solution using arrays with the apply function.
apply(X = array(data = unlist(c(women, women, women, women)), dim = c(2,15,4)), MARGIN = 3, FUN = mean)
I think some kind of transformation to the array may let me use the apply function and then revert it back. I'll update this if I figure it out.

##################
ORIGINAL
##################
Hi Rstudio Community,
I've run across an interesting problem (I'll preface this post with PEMI, please excuse my ignorance, which when I get sufficient privileges is going to be a tag on all my posts :slight_smile: ).
I know how to run functions across rows and columns and data frames.
Is there a nice tidy way to run a function through data.frames and across scalars?
Maybe by converting the list of data.frames to an array?

If I've been unclear so far, let me try to demonstrate what I'm talking about. Where:

example:

women2 <- mice::ampute(data = women2, prop = 0.3)$amp
women2$weight
Amelia::amelia(x = women2, m = 100)$imputations

I want to take the mean of the same value across data.frame's? Meaning if:

>df1 
a   b   c
d   e   f
g   h   i

>df2
j   k   l
m   n   o
p   q   r

>df3
s   t   u
v   w   x
y   z   aa

then what I want is:

df_1_3_FINAL
[mean(c(a,j,s))]   [mean(c(b,k,t))]   [mean(c(c,l,u))]
[mean(c(d,m,v))]   [mean(c(e,n,w))]   [mean(c(f,o,x))]
[mean(c(g,p,y))]   [mean(c(h,q,z))]   [mean(c(i,r,aa))]

I think this is a problem suited for an array, but I haven't dealt with arrays much, or ever seen some examples of arrays being used with dplyr (IMO, arrays are the forgotten options in R similar to the "Inverse Gaussian" distribution in GLM methods).

Does anyone have some alternative options for this kind of a problem?

PS
On a side note, is there a reason we don't use arrays more often? Aren't they supposed to have a smaller memory footprint? If the reason were for the cost of complexity when trying to manipulate data, I wonder if the tidy verse has some ways of simplifying their usage?


#2

Create matrices rather than data frames. A data frame is always two dimentional, but, unlike matrices, can contain columns which are of differents modes. Matrices must be all one mode. data.matrix() and as.matrix() will convert a data frame to a matrix.


#3

The method that uses an array and apply is:

x <- array(1:27, dim = c(3, 3, 3))
x
#> , , 1
#> 
#>      [,1] [,2] [,3]
#> [1,]    1    4    7
#> [2,]    2    5    8
#> [3,]    3    6    9
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3]
#> [1,]   10   13   16
#> [2,]   11   14   17
#> [3,]   12   15   18
#> 
#> , , 3
#> 
#>      [,1] [,2] [,3]
#> [1,]   19   22   25
#> [2,]   20   23   26
#> [3,]   21   24   27

apply(x, MARGIN = c(1, 2), FUN = sum)
#>      [,1] [,2] [,3]
#> [1,]   30   39   48
#> [2,]   33   42   51
#> [3,]   36   45   54

I think one reason you don't see arrays used more often is that the requirement that all the elements be of the same type doesn't suit very many data sets.

I'd quibble a bit with that taxonomy of data structures graphic — data frames are just rectangular lists with certain attributes, and therefore they can contain all the same things lists can, even other data frames. (And to get really picky, R doesn't have scalars, just vectors of length 1). Personally, I prefer the dimensionality–homogeneity taxonomy — but anyway, I don't think my quibbling really matters to this problem!


#4

Nice suggestion @jcblum, I have no idea why MARGIN = c(1,2) works but it is the apply solution I was looking for.

Good point on the elements being the same.

I love the quibbles, I've modified my graphic with your suggestions and inserted it below.

Do you have an alternative for approaching something like this in the tidyverse?


#5

@timothyslau, the MARGIN argument determines how the array gets sliced up for each call to FUN. For example, for MARGIN = c(1, 2), the array is sliced along the first and second dimensions, but not the third. So you're calculating:

#>          [, 1]                 [, 2]                 [, 3]
#> [1, ]    sum(data[1, 1, 1:n])  sum(data[1, 2, 1:n])  sum(data[1, 3, 1:n])
#> [2, ]    sum(data[2, 1, 1:n])  sum(data[2, 2, 1:n])  sum(data[2, 3, 1:n])
#> [3, ]    sum(data[3, 1, 1:n])  sum(data[3, 2, 1:n])  sum(data[3, 3, 1:n])

#6

Then I have another quibble for you :grin:. "Atomic" in R doesn't mean what you expect it to mean — in R, it's the opposite of recursive. Atomic vectors are the ones that are flat no matter what. Lists are recursive vectors (they can contain other structures without flattening them).

An example...
x <- 1:10
is.atomic(x)
#> [1] TRUE

# atomic means the object will always be flat
c(
  1:10, 
  c(4, 6, 8), 
  c(20, c(35, 40)), 
  200
)
#>  [1]   1   2   3   4   5   6   7   8   9  10   4   6   8  20  35  40 200

# lists are recursive by definition, no matter what's in them
y <- list(1)
is.atomic(y)
#> [1] FALSE
is.recursive(y)
#> [1] TRUE

y <- list()
is.recursive(y)
#> [1] TRUE

# lists don't impose flatness
list(
  1:10, 
  c(4, 6, 8), 
  list(20, c(35, 40)),
  200
)
#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10
#> 
#> [[2]]
#> [1] 4 6 8
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] 20
#> 
#> [[3]][[2]]
#> [1] 35 40
#> 
#> 
#> [[4]]
#> [1] 200

I am a big fan of diagrams! You might enjoy taking a look at the in-progress 2nd edition of Advanced R, where Hadley has been working out his own data structure diagrams.

I think the main tidyverse toolkit for doing this is going to be purrr, but I haven't had time to noodle with it yet. Honestly it strikes me as kind of a weird thing to do in a data frame context. I'm having trouble imagining a circumstance where separate data frames would stack up by position like this. I'm curious — what is the task that led you to want to do this with data frames?


#7

I have actually noticed tbl_cube, which looks like a tidyverse analogue of array that might be suitable for working with formats like NetCDF, HDF, GeoTIFF, etc (in fact, I noticed it because tidync can export to it). It looks promising!


#8

When you do imputation, a number of packages return imputed data.frames where I've found using this array method useful for doing descriptive analyses on the imputed values.

example:

women2 <- mice::ampute(data = women, prop = 0.3)$amp
wom2.imp1 <- Amelia::amelia(x = women2, m = 100)$imputations
apply(X = array(data = unlist(wom2.imp1), dim = c(15,2,100)), MARGIN = c(1, 2), FUN = summary)

I'm not sure how to incorporate your feedback into an update to my graphic. I do like @hadley 's though. It is more compact, accurate, and probably useful. Thanks for sharing the feedback!


#9

@rensa, nice callout. I've never used that function before. I'll play around with this. Thank you.