tidying data - reorganizing tibble

hklovs · December 30, 2019, 7:05pm

Hi,
I want to reorganize this tibble:

# A tibble: 10 x 3

```
 ID OP    DATE               *
```
*
1 1 A 2001-01-02 00:00:00*
2 1 B 2015-04-25 00:00:00*
3 2 A 2000-04-01 00:00:00*
4 3 C 2014-04-07 00:00:00*
5 4 C 2012-12-01 00:00:00*
6 4 A 2005-06-16 00:00:00*
7 4 D 2009-01-20 00:00:00*
8 5 A 2010-10-10 00:00:00*
9 5 B 2003-11-09 00:00:00*
10 6 B 2004-01-09 00:00:00

Into this format:

# A tibble: 6 x 11

ID OP1   DATE1               OP2   DATE2               OP3   DATE3               OP4   DATE4 OP5   DATE5*

*
*1 1 A 2000-04-01 00:00:00 A 2001-01-02 00:00:00 NA NA NA NA NA NA *
*2 2 A 2000-04-01 00:00:00 NA NA NA NA NA NA NA NA *
*3 3 C 2014-04-07 00:00:00 NA NA NA NA NA NA NA NA *
*4 4 A 2005-06-16 00:00:00 D 2009-01-20 00:00:00 C 2012-12-01 00:00:00 NA NA NA NA *
*5 5 B 2003-11-09 00:00:00 A 2010-10-10 00:00:00 NA NA NA NA NA NA *
*6 6 B 2004-01-09 00:00:00 NA NA NA NA NA NA NA NA *
>
Anyone eager to help?

hklovs · December 30, 2019, 7:13pm

Sorry - this is how I want it to be organized:

ID OP1 DATE1 OP2 DATE2 OP3 DATE3 OP4 DATE4 OP5 DATE5

1 1 A 2001-01-02 00:00:00 B 2015-04-25 00:00:00 NA NA NA NA NA NA
2 2 A 2000-04-01 00:00:00 NA NA NA NA NA NA NA NA
3 3 C 2014-04-07 00:00:00 NA NA NA NA NA NA NA NA
4 4 A 2005-06-16 00:00:00 D 2009-01-20 00:00:00 C 2012-12-01 00:00:00 NA NA NA NA
5 5 B 2003-11-09 00:00:00 A 2010-10-10 00:00:00 NA NA NA NA NA NA
6 6 B 2004-01-09 00:00:00 NA NA NA NA NA NA NA NA

JohnMount · December 30, 2019, 9:09pm

With the development version of cdata you can do this as follows (sorry the current CRAN version doesn't work on this example, it hit a corner-case I have now fixed in development).

I think the easy strategy with any method is going to be: add explicit rank information before trying to go to a wide form.

Also I think ID-group 4 is in order C A D in your example.

# remotes::install_github("WinVector/cdata")

library(cdata)
library(rqdatatable)
#> Loading required package: rquery

  packageVersion("cdata")
#> [1] '1.1.4'

# example data
d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "C" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )

# get the order based ranking
d$rowid <- seq_len(nrow(d))
d <- rquery::extend(d, 
                    rank %:=% row_number(), 
                    partitionby = "ID", orderby = "rowid")

# draw a picture of the record format
diagram <- wrapr::build_frame(
  "rank", "DATE", "OP" |
    "1", "DATE1", "OP1" |
    "2", "DATE2", "OP2" |
    "3", "DATE3", "OP3" |
    "4", "DATE4", "OP4" |
    "5", "DATE5", "OP5" )

# perform the transform
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)
knitr::kable(res)

ID	DATE1	OP1	DATE2	OP2	DATE3	OP3	DATE4	OP4	DATE5	OP5
1	2001-01-02 00:00:00	A	2015-04-25 00:00:00	B	NA	NA	NA	NA	NA	NA
2	2000-04-01 00:00:00	A	NA	NA	NA	NA	NA	NA	NA	NA
3	2014-04-07 00:00:00	C	NA	NA	NA	NA	NA	NA	NA	NA
4	2012-12-01 00:00:00	C	2005-06-16 00:00:00	A	2009-01-20 00:00:00	D	NA	NA	NA	NA
5	2010-10-10 00:00:00	A	2003-11-09 00:00:00	B	NA	NA	NA	NA	NA	NA
6	2004-01-09 00:00:00	B	NA	NA	NA	NA	NA	NA	NA	NA

hklovs · December 31, 2019, 6:38am

Thanks alot. Quite advanced but I will try. Yes, the example i 4 C A D is an error and should be A D C. OP should be ranked by date.

Any other suggestions/methods?

JohnMount · December 31, 2019, 8:28am

Ah, I ranked by row-order. Ranking by date is just a matter of using orderby = "DATE".

hklovs · December 31, 2019, 9:15am

Thank you. Another problem is this:

get the order based ranking

Bok1$rowid <- seq_len(nrow(d))
Bok1 <- rquery::extend(Bok1,
rank %:=% row_number(),
partitionby = "ID", orderby = "rowid")

show(Bok1)
ID OP DATE rowid rank
1: 1 A 2001-01-02 1 1
2: 1 B 2015-04-25 2 2
3: 2 A 2000-04-01 3 1
4: 3 C 2014-04-07 4 1
5: 4 C 2012-12-01 5 1
6: 4 A 2005-06-16 6 2
7: 4 D 2009-01-20 7 3
8: 5 A 2010-10-10 8 1
9: 5 B 2003-11-09 9 2
10: 6 B 2004-01-09 10 1

draw a picture of the record format

show (diagram)
rank DATE OP
1 1 DATE1 OP1
2 2 DATE2 OP2
3 3 DATE3 OP3
4 4 DATE4 OP4
5 5 DATE5 OP5

perform the transform

res <- blocks_to_rowrecs(Bok1, keyColumns = "ID", controlTable = diagram)

Error in [[.default(vals, 1) : subscript out of bounds

JohnMount · December 31, 2019, 4:02pm

Yes, that is the fault in the CRAN version of cdata. To get the example to work you would need the 1.1.4 development version (which remotes can install, illustrated in the comment). Sorry about the trouble.

Presumably tidyr::pivot_wider() could also perform the transform (this time using dplyr::mutate() to add an explicit ranking column to make the transform easier to specify).

# remotes::install_github("WinVector/cdata")
packageVersion("cdata")
#> [1] '1.1.4'

library(cdata)
library(rqdatatable)
#> Loading required package: rquery
  

# example data
d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "C" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )

# get the time based ranking
d <- rquery::extend(d, 
                    rank %:=% row_number(), 
                    partitionby = "ID", orderby = "DATE")

# draw a picture of the record format
diagram <- wrapr::build_frame(
  "rank", "DATE", "OP" |
    "1", "DATE1", "OP1" |
    "2", "DATE2", "OP2" |
    "3", "DATE3", "OP3" |
    "4", "DATE4", "OP4" |
    "5", "DATE5", "OP5" )

# perform the transform
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)
knitr::kable(res)

ID	DATE1	OP1	DATE2	OP2	DATE3	OP3	DATE4	OP4	DATE5	OP5
1	2001-01-02 00:00:00	A	2015-04-25 00:00:00	B	NA	NA	NA	NA	NA	NA
2	2000-04-01 00:00:00	A	NA	NA	NA	NA	NA	NA	NA	NA
3	2014-04-07 00:00:00	C	NA	NA	NA	NA	NA	NA	NA	NA
4	2005-06-16 00:00:00	A	2009-01-20 00:00:00	D	2012-12-01 00:00:00	C	NA	NA	NA	NA
5	2003-11-09 00:00:00	B	2010-10-10 00:00:00	A	NA	NA	NA	NA	NA	NA
6	2004-01-09 00:00:00	B	NA	NA	NA	NA	NA	NA	NA	NA

hklovs · December 31, 2019, 8:04pm

Thank you very much. Works like a charm

carolineL · December 31, 2019, 11:56pm

You can also use pivot_wider though I'd like a more efficient way to select the columns in the order at the end.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

dat <- tribble(~id, ~op, ~date, 
                 1, "A", "2001-01-02 00:00:00",
                 1, "B", "2015-04-25 00:00:00",
                 2, "A", "2000-04-01 00:00:00",
                 3, "C", "2014-04-07 00:00:00",
                 4, "C", "2012-12-01 00:00:00",
                 4, "A", "2005-06-16 00:00:00",
                 4, "D", "2009-01-20 00:00:00",
                 5, "A", "2010-10-10 00:00:00",
                 5, "B", "2003-11-09 00:00:00",
                 6, "B", "2004-01-09 00:00:00"
               ) %>% 
  mutate(date = ymd_hms(date))
dat
#> # A tibble: 10 x 3
#>       id op    date               
#>    <dbl> <chr> <dttm>             
#>  1     1 A     2001-01-02 00:00:00
#>  2     1 B     2015-04-25 00:00:00
#>  3     2 A     2000-04-01 00:00:00
#>  4     3 C     2014-04-07 00:00:00
#>  5     4 C     2012-12-01 00:00:00
#>  6     4 A     2005-06-16 00:00:00
#>  7     4 D     2009-01-20 00:00:00
#>  8     5 A     2010-10-10 00:00:00
#>  9     5 B     2003-11-09 00:00:00
#> 10     6 B     2004-01-09 00:00:00

dat %>% 
  group_by(id) %>% 
  arrange(date) %>% 
  mutate(rank = row_number()) %>% 
  pivot_wider(values_from = c(op, date), names_from = rank) %>% 
  ungroup() %>% 
  select(id, op_1, date_1, op_2, date_2, op_3, date_3)
#> # A tibble: 6 x 7
#>      id op_1  date_1              op_2  date_2              op_3 
#>   <dbl> <chr> <dttm>              <chr> <dttm>              <chr>
#> 1     2 A     2000-04-01 00:00:00 <NA>  NA                  <NA> 
#> 2     1 A     2001-01-02 00:00:00 B     2015-04-25 00:00:00 <NA> 
#> 3     5 B     2003-11-09 00:00:00 A     2010-10-10 00:00:00 <NA> 
#> 4     6 B     2004-01-09 00:00:00 <NA>  NA                  <NA> 
#> 5     4 A     2005-06-16 00:00:00 D     2009-01-20 00:00:00 C    
#> 6     3 C     2014-04-07 00:00:00 <NA>  NA                  <NA> 
#> # … with 1 more variable: date_3 <dttm>

^{Created on 2019-12-31 by the reprex package (v0.3.0)}

hklovs · January 8, 2020, 1:05pm

Thanks for both solutions. However, I recently found out that my working environment is restricted due to sensitive data, which means its not possible for me to use/download development versions of packages. My questions is is there an alternative solution to my problem using existing non dev methods/functions?
Best regards,
Hans

martin.R · January 8, 2020, 1:15pm

tidyr should work fine straight from CRAN.

hklovs · January 8, 2020, 2:47pm

Well i have problems using pivot_wider, it doesn’t seem to be installed in that version

cderv · January 8, 2020, 2:49pm

pivot_wider was added in tidyr 1.0.0. You should check you have this version

https://tidyr.tidyverse.org/news/index.html#pivoting

JohnMount · January 8, 2020, 5:07pm

cdata has an update fixing the issue in the CRAN queue. So it should be installable in a couple of days. As others have noted, tidyr should work if it is at least version 1.0.0 or newer. packageVersion('tidyr') will tell you what version you have.

hklovs · January 10, 2020, 6:42pm

Thanks again. Because of some stupid reason I’m unable to connect to CRAN and are therefore forced to use tidyr v 0.8.5.

Does anyone possibly know how to solve my problem using spread/gather function in tidyr?

hklovs · January 11, 2020, 7:06pm

Anyone eager to help?

hklovs · January 13, 2020, 6:24pm

Hi again, trying to bump this subject. Everything sorted out with the correct version of tidyr 1.0.0. So that problem solved. However a new one arose: I fell upon the problem with several OP on the same DATE and I would prefer to list them separated with "," instead of new OP.

Listed like this:

ID OP DATE
1 B,C 2008-01-01
2 A 2009-01-01

Is that doable?

JohnMount · January 13, 2020, 6:40pm

I would say you do the pasting with commas as a summarization set while the data is still in tall/long form (one record per row). And then you transform the row shapes after that.

hklovs · January 13, 2020, 8:05pm

Thanks again, you´ve been very helpful I´ll try.

hklovs · January 13, 2020, 8:35pm

I manage this far:

d<-ddply(d, .(ID,DATE), summarise, OP = list(OP))

d
ID DATE OP
1 1 2001-01-02 00:00:00 A, B
2 2 2000-04-01 00:00:00 A
3 3 2014-04-07 00:00:00 C
4 4 2005-06-16 00:00:00 A, D
5 4 2012-12-01 00:00:00 C
6 5 2003-11-09 00:00:00 B
7 5 2010-10-10 00:00:00 A
8 6 2004-01-09 00:00:00 B

But, when I´m trying to apply:
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)

It still gets like this:
ID DATE1 OP1 DATE2 OP2 DATE3 OP3 DATE4 OP4 DATE5 OP5
1: 1 2001-01-02 00:00:00 A 2001-01-02 00:00:00 B
2: 2 2000-04-01 00:00:00 A
3: 3 2014-04-07 00:00:00 C
4: 4 2012-12-01 00:00:00 C 2005-06-16 00:00:00 A 2005-06-16 00:00:00 D
5: 5 2010-10-10 00:00:00 A 2003-11-09 00:00:00 B
6: 6 2004-01-09 00:00:00 B