Convert list of row indices by group into row-matched vector of group indices (in Tableau R SCRIPT_*)

asd · May 15, 2018, 2:05pm

Motivation: I would like to use TDAmapper (which is an algorithm that runs in R) from Tableau (which is a platform where my data is currently located) - in particular, I would like to use TDAmapper to cluster/group certain data points together.

Background on TDAmapper:
What TDAmapper does is to group certain data points into "vertices" (which are like mini-clusters), and then it relates closeby vertices to each other via something called an "edge". In other words there are two kinds of relations - the first relation relates data points that are "close" to each other by grouping them into a vertex, the second relation relates vertices that are close to each other by associating an edge between them.

The way this is done is that we start with a dataset of points, and we define a particular function (known as the "filter function") to assign a value to these points. Once we have done this, we cover these datapoints with a finite number of intervals - furthermore, for the algorithm we also need to specify: (a) the number of intervals we use; (b) the percentage overlap between these intervals.

To illustrate:

library(TDAmapper)

 

library(locfit)

data(chemdiab)

    

normdiab<- chemdiab

 

normdiab[,1:5]<-scale(normdiab[,1:5],center=FALSE)

normdiab.dist=dist(normdiab[,1:5])

    
library(ks)                   

    
filter.kde<-kde(normdiab[,1:5],H=diag(1,nrow = 5),eval.points =normdiab[,1:5])$estimate

 

## filter.kde is defined to be our filter function

 ## In this case, we assign values to the data points based on the kernel density.

    

 

 diab.mapper<-mapper(

   dist_object = normdiab.dist, filter_values = filter.kde,

    num_intervals =4,

    percent_overlap=50,

    num_bins_when_clustering=18)

 

    ## Here, the mapper() algorithm accepts as input the
    ## distance matrix of the data points we want, the filter function,
    ## the number of intervals, the percentage overlap.

    ## We also have another parameter (which affects the clustering algorithm 
    ## that is implicitly used in the TDAmapper algorithm), 
    ## which can be any integer value we like.

Background on Tableau: Tableau is primarily used for generating pretty visuals from the data. The platform also has specific commands that allow us to pass computed computations into R. If we want to use R to do kmeans clustering, we can use the SCRIPT_INT command, which is the command used when you expect an integer result from our computation (more info in the link below).

I've asked the Tableau community for advice here: https://community.tableau.com/message/761790?et=watches.email.thread#761790

Statement of Problem:

The most helpful reply I've gotten from the Tableau community is that the TDAmapper algorithm in R does not return a vector that assigns each point to a cluster like $cluster does for Kmeans in R. More precisely, I was told:

the one missing piece for passing the data to the TDAmapper algorithm is that you'll need to create a dataframe of the values you are passing to R from Tableau. The issue then is returning a result. The TDAmapper algorithm does not return a vector that assigns each point to a cluster like $cluster does for Kmeans. This means you'll need to write a loop or other function to assign the points from the $points_in_vertex section of the results of the mapper() function with the vertex that they fall into. The final result will need to be an ordered vector of vertex assignments ie c(1,1,5,4,3,2,3,2).

In other words, I believe the reason why one can do kmeans clustering via R from Tableau is because of the $cluster component, which gives us a vector that indicates the cluster to which each point of data is allocated. And this information (somehow) is intelligible to the SCRIPT_INT function, so we can pass this result back to Tableau from R.

In our case, we would like to do the same thing for TDAmapper. Now, since TDAmapper imposes two relations on the data (vertices and edges), I would like to find a way to write some kind of loop function that (i) assigns the data points to each vertex, and (ii) assign an edge to the appropriate vertices.

I'm still a novice in R so I'm a little lost as to how I should go about doing this. I only know enough R to be sort of literate of the R coding, and to run some basic commands - I'm not sure how to go about designing the required loop function that I more or less know what I want it to do.

Can somebody give me some concrete steps to which how I could go about solving this problem? Any help/hints on how I can get out of this jam would be very much appreciated!

More info on the R coding of TDAmapper is available here:

jcblum · May 19, 2018, 8:28pm

Thanks for providing useful context and background info for your question! There are still a couple of pieces missing that will make it much easier for people here to be able to help (and increase the likelihood that you get more answers!):

The R code that you are using (so far) inside SCRIPT_INT
Some sample data, as R would receive it from SCRIPT_INT
Any restrictions on the R side
- Are you able/willing to install arbitrary R packages for use in this task, or do you need the R code to work using only certain packages (e.g., base, tidyverse, etc)
- Are there speed/performance requirements?

#2 is very important, since as I understand it, the shape of the data passed to R depends on how you have set up partitioning and addressing on the Tableau side. And as a tip, it's probably better to include this info in your post here, rather than hoping people will find it by following links.

That being said, here are a few thoughts (some of this you may know, but I’m including it partly as breadcrumbs for other R-focused helpers to follow):

The documentation for TDAmapper::mapper (accessible by typing ?TDAmapper::mapper in R), describes how its output is structured:

An object of class TDAmapper which is a list of items named adjacency (adjacency matrix for the edges), num_vertices (integer number of vertices), level_of_vertex (vector with level_of_vertex[i] = index of the level set for vertex i), points_in_vertex (list with points_in_vertex[[i]] = vector of indices of points in vertex i), points_in_level_set (list with points_in_level_set[[i]] = vector of indices of points in level set i, and vertices_in_level_set (list with vertices_in_level_set[[i]] = vector of indices of vertices in level set i.

(typos corrected)

There are a couple of typos in the documentation for the latest development version, which I corrected above for the sake of clarity.

points_in_level points_in_level_set
vertices_in_level vertices_in_level_set

Meanwhile, the documentation for stats::kmeans also explains the structure of its output, which is similarly a list with several components. The relevant component is:

cluster
A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

So yes, the basic problem is that kmeans()$cluster returns a cluster index corresponding to each point, whereas (for instance) mapper()$points_in_vertex returns a vector of point indices corresponding to each vertex.

It would be relatively straightforward to convert the vector of point indices in mapper()$points_in_vertex into vertex indices for each point, and to do the same for the level sets — assuming that a given point is only assigned to a single vertex. This is the case for kmeans clusters, which is why it's easy to map them back onto a data visualization.

However, this assumption is clearly false for the TDAmapper algorithm. Using your example:

library(purrr)

intersect_groups <- function(x, group_list) {
  group_combos <- combn(seq_along(group_list), 2)
  intersect(
    group_list[[group_combos[1, x]]],
    group_list[[group_combos[2, x]]]
  )
}

# Row indices of normdiab that have been assigned to
# more than one vertex
map(
  1:(choose(length(diab.mapper$points_in_vertex), 2)), 
  intersect_groups,
  group_list = diab.mapper$points_in_vertex
) %>% unlist() %>% sort()
#>  [1]  44  45  57  58  71  84  87  95  97  98  99 100 101 102 106 115 117
#> [18] 121 123 127 130 135 136 138

Even if you swap the output into a per-point format, more than one result is going to be returned for some rows of data. This makes it impossible to return a single vector of groupings, comparable to kmeans()$cluster.

You'll need to decide how you want to handle rows of data assigned to more than one vertex before going any further.

From my cursory reading about Tableau's R integration, it sounds like a common pattern when an R function returns multiple values is to concatenate them into a string on the R side, then split them back out using Tableau tools — so that could possibly be an option here. Is that feasible for you?

jcblum · May 19, 2018, 10:07pm

(I usually try to avoid replying to myself, but that last answer got long! )

One more tip: you might consider editing the title of your topic to focus more on the core problem from an R perspective. Here’s one suggestion: “Convert list of row indices by group into row-matched vector of group indices (in Tableau R SCRIPT_*)”

There are several approaches for doing this in R, and I suspect a title that zeroes in on the technical challenge might attract a wider range of possible solutions.

asd · May 21, 2018, 3:12pm

Thank you very much for your reply - I appreciate you taking the time!

As far as the three questions you asked go:

I'm not using any R code in SCRIPT_INT thus far - I'm just trying to use TDAmapper as as clustering algorithm from Tableau.
Ah, so I'm interested in clustering groups of customers together based on whether they behave similarly or not. The database isn't complete yet, but ideally we would like to cut the data by variables (also called "dimensions") such as frequency of purchase, average order value, recency of purchase etc. Essentially, we know that there are general "types" of customers that our company serves, and we would like to use TDAmapper to help us identify the relevant clusters from the dataset as a way of rigorously defining these types of customers.
I'm perfectly happy to install any R package for use in the task, and I don't have any specific speed requirements.

Yes, you're quite right, there are certain data points which are associated with more than one vertex. The TDAmapper algorithm works like this: if there exists data points which are assigned more than one vertex, then we draw an edge between them. To visualise this, see the picture below:

As we can see on the right hand side, we have a diagram containing the round circles (i.e. vertices), which are our mini-clusters. If a data point has been assigned to two vertices, then we draw a line between them, known as an "edge". In other words, we have two notions of closeness according to the TDAmapper - data points which are "close" to each other are associated to a vertex (or vertices); vertices which are close to each other (i.e. if they overlap) are connected by an "edge".

With this in mind, I am interested in figuring out how to design a function that is analogous to $cluster insofar that I get some kind of integer result in the end (which should definitely be possible - there are only finite number of vertices and edges). Your suggestion about concatenating the multiple values into a string before splitting them back in Tableau sounds quite promising - I'll have a dig deeper on what can be done on the Tableau front!