Sorting tibble with Rcpp

Hello,

I am trying to integrate two different functionality within an Rcpp framework. The first component is to uniquely subset a tibble/dataframe - I was able to do that within Rcpp.

The second component is to sort the subset of tibble on a date column before proceeding with calculations. Would anyone have a code snippet or a resource I can study to sort/order a tibble using Rcpp?

Regards

That's an interesting use case. Usually the use case is integrating a function written in c++ into R. You want to use R functionality in c++? Without a reproducible example, I'm at a loss for providing direction with code.

You can take a look a the implementation of dplyr's filter I expect you could include dplyr.h and use what the dplyr c++ namespace provides as the helpers to the R functions directly.

Thank you for your kind reply. Let me add some more substance to the problem at hand. My workflow is in entirely in R/dplyr. A certain logic of the code needs extensive look-back & iteration - to speed up the code, I am trying to write that piece of logic in Rcpp. Here's an hack attempt with a sample data-set and a bit of pseudocode to give you a slightly better idea.

library(readr);
library(dplyr);

Data frame df with Symbol, Date & VWAP.

structure(list(Symbol = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Date = structure(c(9L, 7L, 5L, 6L, 8L,
4L, 3L, 1L, 2L, 9L, 8L, 7L, 5L, 6L, 4L, 3L, 1L, 2L), .Label = c("2020-01-21",
"2020-01-22", "2020-01-23", "2020-01-24", "2020-01-28", "2020-01-29",
"2020-01-30", "2020-01-31", "2020-02-03"), class = "factor"),
VWAP = c(38.25, 39, 39.1, 39.34, 39.4, 40.45, 41.08, 41.2,
41.21, 96.86, 98.77, 99.19, 99.99, 100.43, 103.18, 106.19,
106.45, 107.38)), class = "data.frame", row.names = c(NA,
-18L))

Rcpp function with logic to:

        1. Uniquely subset the dataframe by Symbol A & B (already solved).
        2. Sort the subset dataframe by Date  (attempting to solve and the main question of this post)
        3. Loop thru VWAP to calculate the variables of interest for various date range 
        4. Return the aggregate dataframe back to R

Regards

Clarification of the issue

Is the issue that you have a fully functioning workflow using R and dplyr, but it's too slow? If that's the case, do you have any benchmarks for the time it takes and what's a the time expected for a sufficiently fast implementation?

Parallelize

Sound like this is a compute bound problem. You could try splitting the problem up and computing in parallel. The parallel or foreach packages can help with that. Also could take a look at the experimental multidplyr pacakge.

Less elegant parallel processing

Alternatively, I've found a great deal of use from just writing a dumb bash script to orchestrate smaller R scripts to chop up the big data set as required, spawn child processes to read in the smaller portions and do their thing, then a script to aggregate the results.

2 Likes

For speed check data.table first (it's C under the hood):

There's a dplyr interface to it if you need to retain the dplyr syntax via dtplyr:

3 Likes

I was able to hack a solution using the following piece of Rcpp code:


#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace arma;

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
DataFrame sort_df(arma::vec dt, arma::vec ret) {
    uvec sorted_dt_idx;

    sorted_dt_idx =  sort_index(dt);

    dt  = arma::sort(dt);
    ret = ret(sorted_dt_idx);

    Rcpp::DataFrame df = Rcpp::DataFrame::create(Named("Dt") =dt,Named("Ret") = ret);

    return(df);
}

Thanks for all your help.

Regards

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.