Motivation: I would like to use TDAmapper (which is an algorithm that runs in R) from Tableau (which is a platform where my data is currently located) - in particular, I would like to use TDAmapper to cluster/group certain data points together.
Background on TDAmapper:
What TDAmapper does is to group certain data points into "vertices" (which are like mini-clusters), and then it relates closeby vertices to each other via something called an "edge". In other words there are two kinds of relations - the first relation relates data points that are "close" to each other by grouping them into a vertex, the second relation relates vertices that are close to each other by associating an edge between them.
The way this is done is that we start with a dataset of points, and we define a particular function (known as the "filter function") to assign a value to these points. Once we have done this, we cover these datapoints with a finite number of intervals - furthermore, for the algorithm we also need to specify: (a) the number of intervals we use; (b) the percentage overlap between these intervals.
To illustrate:
library(TDAmapper)
library(locfit)
data(chemdiab)
normdiab<- chemdiab
normdiab[,1:5]<-scale(normdiab[,1:5],center=FALSE)
normdiab.dist=dist(normdiab[,1:5])
library(ks)
filter.kde<-kde(normdiab[,1:5],H=diag(1,nrow = 5),eval.points =normdiab[,1:5])$estimate
## filter.kde is defined to be our filter function
## In this case, we assign values to the data points based on the kernel density.
diab.mapper<-mapper(
dist_object = normdiab.dist, filter_values = filter.kde,
num_intervals =4,
percent_overlap=50,
num_bins_when_clustering=18)
## Here, the mapper() algorithm accepts as input the
## distance matrix of the data points we want, the filter function,
## the number of intervals, the percentage overlap.
## We also have another parameter (which affects the clustering algorithm
## that is implicitly used in the TDAmapper algorithm),
## which can be any integer value we like.
Background on Tableau: Tableau is primarily used for generating pretty visuals from the data. The platform also has specific commands that allow us to pass computed computations into R. If we want to use R to do kmeans clustering, we can use the SCRIPT_INT command, which is the command used when you expect an integer result from our computation (more info in the link below).
I've asked the Tableau community for advice here: https://community.tableau.com/message/761790?et=watches.email.thread#761790
Statement of Problem:
The most helpful reply I've gotten from the Tableau community is that the TDAmapper algorithm in R does not return a vector that assigns each point to a cluster like $cluster does for Kmeans in R. More precisely, I was told:
the one missing piece for passing the data to the TDAmapper algorithm is that you'll need to create a dataframe of the values you are passing to R from Tableau. The issue then is returning a result. The TDAmapper algorithm does not return a vector that assigns each point to a cluster like $cluster does for Kmeans. This means you'll need to write a loop or other function to assign the points from the $points_in_vertex section of the results of the mapper() function with the vertex that they fall into. The final result will need to be an ordered vector of vertex assignments ie c(1,1,5,4,3,2,3,2).
In other words, I believe the reason why one can do kmeans clustering via R from Tableau is because of the $cluster component, which gives us a vector that indicates the cluster to which each point of data is allocated. And this information (somehow) is intelligible to the SCRIPT_INT function, so we can pass this result back to Tableau from R.
In our case, we would like to do the same thing for TDAmapper. Now, since TDAmapper imposes two relations on the data (vertices and edges), I would like to find a way to write some kind of loop function that (i) assigns the data points to each vertex, and (ii) assign an edge to the appropriate vertices.
I'm still a novice in R so I'm a little lost as to how I should go about doing this. I only know enough R to be sort of literate of the R coding, and to run some basic commands - I'm not sure how to go about designing the required loop function that I more or less know what I want it to do.
Can somebody give me some concrete steps to which how I could go about solving this problem? Any help/hints on how I can get out of this jam would be very much appreciated!
More info on the R coding of TDAmapper is available here: