tidying data - reorganizing tibble

Hi,
I want to reorganize this tibble:

# A tibble: 10 x 3

  •  ID OP    DATE               *
    
  • *
  • 1 1 A 2001-01-02 00:00:00*
  • 2 1 B 2015-04-25 00:00:00*
  • 3 2 A 2000-04-01 00:00:00*
  • 4 3 C 2014-04-07 00:00:00*
  • 5 4 C 2012-12-01 00:00:00*
  • 6 4 A 2005-06-16 00:00:00*
  • 7 4 D 2009-01-20 00:00:00*
  • 8 5 A 2010-10-10 00:00:00*
  • 9 5 B 2003-11-09 00:00:00*
    10 6 B 2004-01-09 00:00:00

Into this format:

# A tibble: 6 x 11

  • ID OP1   DATE1               OP2   DATE2               OP3   DATE3               OP4   DATE4 OP5   DATE5*
    
  • *
    *1 1 A 2000-04-01 00:00:00 A 2001-01-02 00:00:00 NA NA NA NA NA NA *
    *2 2 A 2000-04-01 00:00:00 NA NA NA NA NA NA NA NA *
    *3 3 C 2014-04-07 00:00:00 NA NA NA NA NA NA NA NA *
    *4 4 A 2005-06-16 00:00:00 D 2009-01-20 00:00:00 C 2012-12-01 00:00:00 NA NA NA NA *
    *5 5 B 2003-11-09 00:00:00 A 2010-10-10 00:00:00 NA NA NA NA NA NA *
    *6 6 B 2004-01-09 00:00:00 NA NA NA NA NA NA NA NA *
    >
    Anyone eager to help? :slight_smile:

Sorry - this is how I want it to be organized:

ID OP1 DATE1 OP2 DATE2 OP3 DATE3 OP4 DATE4 OP5 DATE5

1 1 A 2001-01-02 00:00:00 B 2015-04-25 00:00:00 NA NA NA NA NA NA
2 2 A 2000-04-01 00:00:00 NA NA NA NA NA NA NA NA
3 3 C 2014-04-07 00:00:00 NA NA NA NA NA NA NA NA
4 4 A 2005-06-16 00:00:00 D 2009-01-20 00:00:00 C 2012-12-01 00:00:00 NA NA NA NA
5 5 B 2003-11-09 00:00:00 A 2010-10-10 00:00:00 NA NA NA NA NA NA
6 6 B 2004-01-09 00:00:00 NA NA NA NA NA NA NA NA

With the development version of cdata you can do this as follows (sorry the current CRAN version doesn't work on this example, it hit a corner-case I have now fixed in development).

I think the easy strategy with any method is going to be: add explicit rank information before trying to go to a wide form.

Also I think ID-group 4 is in order C A D in your example.

# remotes::install_github("WinVector/cdata")

library(cdata)
library(rqdatatable)
#> Loading required package: rquery

  packageVersion("cdata")
#> [1] '1.1.4'

# example data
d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "C" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )

# get the order based ranking
d$rowid <- seq_len(nrow(d))
d <- rquery::extend(d, 
                    rank %:=% row_number(), 
                    partitionby = "ID", orderby = "rowid")

# draw a picture of the record format
diagram <- wrapr::build_frame(
  "rank", "DATE", "OP" |
    "1", "DATE1", "OP1" |
    "2", "DATE2", "OP2" |
    "3", "DATE3", "OP3" |
    "4", "DATE4", "OP4" |
    "5", "DATE5", "OP5" )

# perform the transform
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)
knitr::kable(res)
ID DATE1 OP1 DATE2 OP2 DATE3 OP3 DATE4 OP4 DATE5 OP5
1 2001-01-02 00:00:00 A 2015-04-25 00:00:00 B NA NA NA NA NA NA
2 2000-04-01 00:00:00 A NA NA NA NA NA NA NA NA
3 2014-04-07 00:00:00 C NA NA NA NA NA NA NA NA
4 2012-12-01 00:00:00 C 2005-06-16 00:00:00 A 2009-01-20 00:00:00 D NA NA NA NA
5 2010-10-10 00:00:00 A 2003-11-09 00:00:00 B NA NA NA NA NA NA
6 2004-01-09 00:00:00 B NA NA NA NA NA NA NA NA

Thanks alot. Quite advanced but I will try. Yes, the example i 4 C A D is an error and should be A D C. OP should be ranked by date.

Any other suggestions/methods?

Ah, I ranked by row-order. Ranking by date is just a matter of using orderby = "DATE".

Thank you. Another problem is this:

get the order based ranking

Bok1$rowid <- seq_len(nrow(d))
Bok1 <- rquery::extend(Bok1,
rank %:=% row_number(),
partitionby = "ID", orderby = "rowid")

show(Bok1)
ID OP DATE rowid rank
1: 1 A 2001-01-02 1 1
2: 1 B 2015-04-25 2 2
3: 2 A 2000-04-01 3 1
4: 3 C 2014-04-07 4 1
5: 4 C 2012-12-01 5 1
6: 4 A 2005-06-16 6 2
7: 4 D 2009-01-20 7 3
8: 5 A 2010-10-10 8 1
9: 5 B 2003-11-09 9 2
10: 6 B 2004-01-09 10 1

draw a picture of the record format

diagram <- wrapr::build_frame(
"rank", "DATE", "OP" |
"1", "DATE1", "OP1" |
"2", "DATE2", "OP2" |
"3", "DATE3", "OP3" |
"4", "DATE4", "OP4" |
"5", "DATE5", "OP5" )

show (diagram)
rank DATE OP
1 1 DATE1 OP1
2 2 DATE2 OP2
3 3 DATE3 OP3
4 4 DATE4 OP4
5 5 DATE5 OP5

perform the transform

res <- blocks_to_rowrecs(Bok1, keyColumns = "ID", controlTable = diagram)

Error in [[.default(vals, 1) : subscript out of bounds

Yes, that is the fault in the CRAN version of cdata. To get the example to work you would need the 1.1.4 development version (which remotes can install, illustrated in the comment). Sorry about the trouble.

Presumably tidyr::pivot_wider() could also perform the transform (this time using dplyr::mutate() to add an explicit ranking column to make the transform easier to specify).

# remotes::install_github("WinVector/cdata")
packageVersion("cdata")
#> [1] '1.1.4'

library(cdata)
library(rqdatatable)
#> Loading required package: rquery
  

# example data
d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "C" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )

# get the time based ranking
d <- rquery::extend(d, 
                    rank %:=% row_number(), 
                    partitionby = "ID", orderby = "DATE")

# draw a picture of the record format
diagram <- wrapr::build_frame(
  "rank", "DATE", "OP" |
    "1", "DATE1", "OP1" |
    "2", "DATE2", "OP2" |
    "3", "DATE3", "OP3" |
    "4", "DATE4", "OP4" |
    "5", "DATE5", "OP5" )

# perform the transform
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)
knitr::kable(res)
ID DATE1 OP1 DATE2 OP2 DATE3 OP3 DATE4 OP4 DATE5 OP5
1 2001-01-02 00:00:00 A 2015-04-25 00:00:00 B NA NA NA NA NA NA
2 2000-04-01 00:00:00 A NA NA NA NA NA NA NA NA
3 2014-04-07 00:00:00 C NA NA NA NA NA NA NA NA
4 2005-06-16 00:00:00 A 2009-01-20 00:00:00 D 2012-12-01 00:00:00 C NA NA NA NA
5 2003-11-09 00:00:00 B 2010-10-10 00:00:00 A NA NA NA NA NA NA
6 2004-01-09 00:00:00 B NA NA NA NA NA NA NA NA

Thank you very much. Works like a charm

1 Like

You can also use pivot_wider though I'd like a more efficient way to select the columns in the order at the end.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

dat <- tribble(~id, ~op, ~date, 
                 1, "A", "2001-01-02 00:00:00",
                 1, "B", "2015-04-25 00:00:00",
                 2, "A", "2000-04-01 00:00:00",
                 3, "C", "2014-04-07 00:00:00",
                 4, "C", "2012-12-01 00:00:00",
                 4, "A", "2005-06-16 00:00:00",
                 4, "D", "2009-01-20 00:00:00",
                 5, "A", "2010-10-10 00:00:00",
                 5, "B", "2003-11-09 00:00:00",
                 6, "B", "2004-01-09 00:00:00"
               ) %>% 
  mutate(date = ymd_hms(date))
dat
#> # A tibble: 10 x 3
#>       id op    date               
#>    <dbl> <chr> <dttm>             
#>  1     1 A     2001-01-02 00:00:00
#>  2     1 B     2015-04-25 00:00:00
#>  3     2 A     2000-04-01 00:00:00
#>  4     3 C     2014-04-07 00:00:00
#>  5     4 C     2012-12-01 00:00:00
#>  6     4 A     2005-06-16 00:00:00
#>  7     4 D     2009-01-20 00:00:00
#>  8     5 A     2010-10-10 00:00:00
#>  9     5 B     2003-11-09 00:00:00
#> 10     6 B     2004-01-09 00:00:00

dat %>% 
  group_by(id) %>% 
  arrange(date) %>% 
  mutate(rank = row_number()) %>% 
  pivot_wider(values_from = c(op, date), names_from = rank) %>% 
  ungroup() %>% 
  select(id, op_1, date_1, op_2, date_2, op_3, date_3)
#> # A tibble: 6 x 7
#>      id op_1  date_1              op_2  date_2              op_3 
#>   <dbl> <chr> <dttm>              <chr> <dttm>              <chr>
#> 1     2 A     2000-04-01 00:00:00 <NA>  NA                  <NA> 
#> 2     1 A     2001-01-02 00:00:00 B     2015-04-25 00:00:00 <NA> 
#> 3     5 B     2003-11-09 00:00:00 A     2010-10-10 00:00:00 <NA> 
#> 4     6 B     2004-01-09 00:00:00 <NA>  NA                  <NA> 
#> 5     4 A     2005-06-16 00:00:00 D     2009-01-20 00:00:00 C    
#> 6     3 C     2014-04-07 00:00:00 <NA>  NA                  <NA> 
#> # … with 1 more variable: date_3 <dttm>

Created on 2019-12-31 by the reprex package (v0.3.0)

1 Like

Thanks for both solutions. However, I recently found out that my working environment is restricted due to sensitive data, which means its not possible for me to use/download development versions of packages. My questions is is there an alternative solution to my problem using existing non dev methods/functions?
Best regards,
Hans

tidyr should work fine straight from CRAN.

Well i have problems using pivot_wider, it doesn’t seem to be installed in that version

pivot_wider was added in tidyr 1.0.0. You should check you have this version

https://tidyr.tidyverse.org/news/index.html#pivoting

cdata has an update fixing the issue in the CRAN queue. So it should be installable in a couple of days. As others have noted, tidyr should work if it is at least version 1.0.0 or newer. packageVersion('tidyr') will tell you what version you have.

2 Likes

Thanks again. Because of some stupid reason I’m unable to connect to CRAN and are therefore forced to use tidyr v 0.8.5.

Does anyone possibly know how to solve my problem using spread/gather function in tidyr?

Anyone eager to help? :wink:

Hi again, trying to bump this subject. Everything sorted out with the correct version of tidyr 1.0.0. So that problem solved. However a new one arose: I fell upon the problem with several OP on the same DATE and I would prefer to list them separated with "," instead of new OP.

Listed like this:

ID OP DATE
1 B,C 2008-01-01
2 A 2009-01-01

Is that doable? :slight_smile:

I would say you do the pasting with commas as a summarization set while the data is still in tall/long form (one record per row). And then you transform the row shapes after that.

1 Like

Thanks again, you´ve been very helpful :slight_smile: I´ll try.

I manage this far:

d<-ddply(d, .(ID,DATE), summarise, OP = list(OP))

d
ID DATE OP
1 1 2001-01-02 00:00:00 A, B
2 2 2000-04-01 00:00:00 A
3 3 2014-04-07 00:00:00 C
4 4 2005-06-16 00:00:00 A, D
5 4 2012-12-01 00:00:00 C
6 5 2003-11-09 00:00:00 B
7 5 2010-10-10 00:00:00 A
8 6 2004-01-09 00:00:00 B

But, when I´m trying to apply:
res <- blocks_to_rowrecs(d, keyColumns = "ID", controlTable = diagram)

It still gets like this:
ID DATE1 OP1 DATE2 OP2 DATE3 OP3 DATE4 OP4 DATE5 OP5
1: 1 2001-01-02 00:00:00 A 2001-01-02 00:00:00 B
2: 2 2000-04-01 00:00:00 A
3: 3 2014-04-07 00:00:00 C
4: 4 2012-12-01 00:00:00 C 2005-06-16 00:00:00 A 2005-06-16 00:00:00 D
5: 5 2010-10-10 00:00:00 A 2003-11-09 00:00:00 B
6: 6 2004-01-09 00:00:00 B