Going long to wide with multiple variables based on

dubrowg · February 16, 2018, 12:16am

Hello -

Trying to go from long to wide on a dataframe with multiple variables. The data represent evaluation outcomes. There's an ID (for the case), date of eval, eval core, eval reader, then a series of 8 characteristics with values of 1 or 0 (yes or no).

A truncated picture of the head(df) is below

# A tibble: 6 x 8
id        datetime                   eval 		reader     rigor    learn
<chr>   <dttm>                   <chr>           <chr>     <dbl>    <dbl>
10001 2017-01-23 11:08:38 No              CM             0        0   
10001 2016-11-26 07:15:58 No              CL              0        0   
10002 2017-02-04 14:03:17 No              SM             1        0   
10002 2016-11-26 14:04:48 No              LM             1        1
10003 2016-11-27 11:05:11 No              AH              0        0   
10003 2017-02-01 08:35:08 No              NT             0        0

(plus 6 more characteristics not shown)

I'd like to turn it long to wide so that each ID is it's own line and time, reader, eval and characteristics each have their own column with a 1 or 2 (for 1st and 2nd evals).

I'm migrating my work from SAS to r...this is something I've done a ton of times in SAS but new to this level of data wrangling in r.

I tried spread with this code:

evaldf2 <- spread(evaldf, id, datetime, eval, reader, rigor, learn,  
leadership, persist)

and got this error:

Error in spread(read17reprex, cpid, form_datetime, form_evaluation, form_reader,  : 
  unused arguments (leadership, persist)

Then tried reshape on the entire set:

evaldf2 <- reshape(evaldf, idvar = "id", direction = "wide")

Which threw back this error:

Error: Column `time` not found

Not sure what I'm missing here. Instead of going wide should I create a eval sequence number...1 and 2 for 1st and 2nd eval...so I can do my calculations (characteristics by eval, reader, etc...) based on read number?

thanks,

-Greg

Frank · February 16, 2018, 12:49am

You could try data.table

library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var = setdiff(names(DF), "id"))

      id     date_1     date_2   time_1   time_2 eval_1 eval_2 reader_1 reader_2 rigor_1 rigor_2 learn_1 learn_2
1: 10001 2017-01-23 2016-11-26 11:08:38 07:15:58     No     No       CM       CL       0       0       0       0
2: 10002 2017-02-04 2016-11-26 14:03:17 14:04:48     No     No       SM       LM       1       1       0       1
3: 10003 2016-11-27 2017-02-01 11:05:11 08:35:08     No     No       AH       NT       0       0       0       0

where reproducible data for this is

DF = structure(list(id = c(10001L, 10001L, 10002L, 10002L, 10003L, 
10003L), date = c("2017-01-23", "2016-11-26", "2017-02-04", "2016-11-26", 
"2016-11-27", "2017-02-01"), time = c("11:08:38", "07:15:58", 
"14:03:17", "14:04:48", "11:05:11", "08:35:08"), eval = c("No", 
"No", "No", "No", "No", "No"), reader = c("CM", "CL", "SM", "LM", 
"AH", "NT"), rigor = c(0L, 0L, 1L, 1L, 0L, 0L), learn = c(0L, 
0L, 0L, 1L, 0L, 0L)), .Names = c("id", "date", "time", "eval", 
"reader", "rigor", "learn"), row.names = c(NA, -6L), class = "data.frame")

martin.R · February 16, 2018, 12:52am

Try reshape2::dcast or data.table::dcast.

There is a method using tidyr::spread, but it's ugly and can convert data types, so is not satisfactory.

jasonparker · February 16, 2018, 4:54am

How are you planning to analyze the data after this transformation? I like the last idea you suggest in your original post, in which you would add an eval index number. You could take that step very simply via tidyverse with:

result <- evaldf %>% group_by(id) %>% arrange(datatime) %>% mutate(eval_index = row_number()) %>% ungroup

Leaving your data structured in a long format will make it simpler to create visualizations off the primary data set. If you need to start calculating metrics, you'll want to do some sort of summary anyway, so it seems to me that transforming the data to a wide format wouldn't be that helpful.

dubrowg · February 16, 2018, 3:53pm

Yes, the more I thought about it keeping as is but adding an event counter seems a better way to go.

I tried your code and got this error:

Error in rank(x, ties.method = "first", na.last = "keep") : 
  argument "x" is missing, with no default

Tried a couple of rank arguments (from base r) and kept getting the same error.

Then found this on SO:

evaldf %>%
  group_by(id, idx = cumsum(datetime == 1L)) %>%
  mutate(evalnum = row_number()) %>%
  ungroup %>%
  select(-idx)

Which threw the same error.

Then tried a combo of your code and the SO idea:

evaldf %>%
  group_by(id, idx = cumsum(id == 1L)) %>%
  arrange(datetime) %>%
  mutate(evalnum = row_number()) %>%
  ungroup %>%
  select(-idx)

...and at least got a different error:

Error in mutate_impl(.data, dots) : 
  Evaluation error: comparison (1) is possible only for atomic and list

Frank · February 16, 2018, 4:20pm

Those are some weird errors to see with that code. Maybe start a new R session and construct a reproducible example.

mara · February 16, 2018, 4:23pm

Echoing @Frank here, one of the reasons to use the reprex package (or at least add the requisite code chunk backticks) is that it avoids automatically reformatting your quotation marks " " vs “ ”, which can wreak all kinds of havoc when running code.

So, really, just another drop in the "help me help you" bucket!

dubrowg · February 16, 2018, 6:24pm

For @mara and @Frank - yes, I do understand about the reprex. Had issues trying to get one (wouldn't copy to clipboard) which is why I did external copy-paste. I see now the error of that way.

So...took @Frank's advice to restart the session and that did the trick. @jasonparker 's idea worked like a charm.

Thanks to all...it's weird having to relearn here in r all the things I could do so easily in SAS but now I can do a counter in r. Hoorah! So cheers and thanks to all.

mara · February 16, 2018, 11:25pm

Glad you've got it sorted! Not trying to brow-beat you over the lack of reprex– just a veteran of many, many moments of not being able to figure out why things weren't working because of good ol' ” “ and co!