How to build a matrix of 0, 1 and 2 according to the answers made by individuals of a survey

eugenio.alladio · October 25, 2018, 10:43am

I have a certain variable going from x_1 up to x_45 (i.e. 45 possible input).
Making a survey, the questioned individuals provide me an double response, selecting 2 of the possible variables. It is also possible to select the same variables twice. Here's reported an example for 3 subjects:

A: 3,8
B: 11,15
C: 9,9

I'd like to create a matrix of 0, 1 and 2 according to the choices of the individuals, where 1 indicate the variable that the subject have selected, 2 indicates that the variable has been selected twice, otherwise the input is 0 if the variable has not been selected.
An example of the matrix I'd like to obtain is reported, as follows (according to the values provided by individuals A, B and C):

Subject,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,x_11,x_12,x_13,x_14,x_15
A,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
B,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
C,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0

How can I write a code in order to do this for a large number of individuals (e.g. up to 1000 subjects)?
I'd be very grateful if someone could attend to this matter!

mishabalyasin · October 25, 2018, 12:52pm

Hi, here is one approach you can take:

library(magrittr)
input <- tibble::tribble(
  ~subject, ~first, ~second,
  "A",        3,      8,
  "B",        11,     15,
  "C",        9,      9
)

max_cols <- 15
    
subjects <- input %>% dplyr::pull(subject)

encode <- function(vector, max_cols){
  m <- matrix(rep(0, max_cols * length(vector)), 
              nrow = length(vector), ncol = max_cols,
              dimnames = list(subjects))
  purrr::iwalk(vector, function(position, row){
    m[row, position] <<- 1
  })
  m
}

input %>%
  dplyr::select(-subject) %>%
  purrr::map_at(names(.), 
                encode, max_cols = max_cols) %>%
  purrr::reduce(`+`)
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> A    0    0    1    0    0    0    0    1    0     0     0     0     0
#> B    0    0    0    0    0    0    0    0    0     0     1     0     0
#> C    0    0    0    0    0    0    0    0    2     0     0     0     0
#>   [,14] [,15]
#> A     0     0
#> B     0     1
#> C     0     0

^{Created on 2018-10-25 by the reprex package (v0.2.1)}

The main function is encode that converts a vector (e.g., in your case 3, 11, 9) into matrix with as many rows as there are subjects and as many columns as there are questions (you can set it yourself with max_cols). This matrix will have 0 everywhere except choices of your respondents, where it will have 1.

Next step is to create multiple such matrices (that is done in map_at line). Finally, you sum matrices together with reduce. You can easily extend this approach to as many choices and as many subjects as you want. Good thing about matrices is that they are quite memory efficient and fast, so even with millions of rows it'll still be quite fast to sum them. In your case with 1000's of respondents that will never be a problem.

eugenio.alladio · October 25, 2018, 1:54pm

Dear Misha,
thank you very much for your fast and quick reply!
Your approach helps me a lot!
Do you think it might work also with a matrix imported from .txt text?
Thank you so much again!