tibble and Rcpp

I like coding stuff in Rcpp. And I like the tidyverse, especially tibbles and group_by. How do I work with tibbles in Rcpp? It's easy to just treat as a data.frame, but then I lose the group_by information.

There is no specific api to access grouping information from the C++ side, however it's all stored as attributes of the data frame.

The attributes used to be messy, but as part of this PR we've made it much cleaner, and all the information is stored in a tibble, e.g.

library(dplyr, warn.conflicts = FALSE)

d <- group_by(iris, Species)

# 1-based indices of rows of each group
group_rows(d)
#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#> [47] 47 48 49 50
#> 
#> [[2]]
#>  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
#> [18]  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
#> [35]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
#> 
#> [[3]]
#>  [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
#> [18] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
#> [35] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

# keys or "representatives of each group
group_keys(d)
#> # A tibble: 3 x 1
#>   Species   
#>   <fct>     
#> 1 setosa    
#> 2 versicolor
#> 3 virginica

# both
group_data(d)
#> # A tibble: 3 x 2
#>   Species    .rows     
#>   <fct>      <list>    
#> 1 setosa     <int [50]>
#> 2 versicolor <int [50]>
#> 3 virginica  <int [50]>

# it's all stored in the "groups" attribute
# its last column is a list column of indices
attr(d, "groups", exact = TRUE)
#> # A tibble: 3 x 2
#>   Species    .rows     
#>   <fct>      <list>    
#> 1 setosa     <int [50]>
#> 2 versicolor <int [50]>
#> 3 virginica  <int [50]>

# we can use that information internally to 
# e.g. get the size of each group
Rcpp::cppFunction('IntegerVector counts(DataFrame df) {
  DataFrame groups(df.attr("groups"));
  List rows = groups[groups.size()-1];
  int n = groups.nrow();
  IntegerVector res(n); 

  for(int i=0; i<n; i++) {
    IntegerVector index = rows[i]; 
    res[i] = index.size();
  }

  return res;
}')
counts(d)
#> [1] 50 50 50

Created on 2019-02-12 by the reprex package (v0.2.1.9000)

6 Likes

Thanks, Romain, you saved me some time! That's exactly what I was looking for.

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.