Hi,
The request from @Yarnabrina set me on a quest to create a more versatile system for summarizing curves and plotting them using ggplot ...
There's so much I could explain about it, but think the post would become too long lol. I am thinking of writing it up because I do think it can be handy for others to use, but for now I'll just give the code and summary.
SUMMARY CURVE FUNCTION
library("tibble")
library("dplyr")
summaryCurve = function(datasets, columnInfo, summaryFunction = mean,
interpolationMethod = "linear", onlyReturnSummary = T){
#Prepare datasets
#----------------
nSets = length(datasets)
if(nSets == 1){#The user provided one data frame with one x-column and multiple y-columns
if(!all(!is.na(combinedData[,2]))){
stop("The x-column cannot have missing values")
}
combinedData = cbind(data.frame(id = 1:nrow(datasets)),
data.frame(x = datasets[,columnInfo]),
datasets %>% select(-columnInfo)
)
} else { #The user provided multiple data frames
#Get all possible x-values
x = sapply(1:nSets, function(i){
datasets[[i]] %>% select(columnInfo[[i]][1])
}) %>% unlist %>% unique %>% sort
#Build data frame with column for x value
combinedData = tibble(id = 1:length(x), x = x)
# ... and one column for y for every set
for(i in 1:nSets){
xColName = columnInfo[[i]][1]
nYcols = length(columnInfo[[i]]) - 1
if(nYcols > 0){
combinedData = combinedData %>%
left_join(datasets[[i]] %>% select(columnInfo[[i]]),
by = c(x = xColName))
} else {
combinedData = combinedData %>%
left_join(datasets[[i]], by = c(x = xColName))
}
}
}
#Interpolate curves
#-------------------
#Apply an interpolation function to every y-column to fill in missing values
combinedData[,-c(1,2)] = apply(combinedData[,-c(1,2)], 2, function(y){
approx(combinedData$x, y, xout = combinedData$x, method = interpolationMethod)$y
})
#Now add the summaryCurve
summaryValues = apply(combinedData[,-c(1,2)], 1, function(x){
summaryFunction(x[!is.na(x)])
})
if(onlyReturnSummary){
return(data.frame(x = combinedData[,2], summary = summaryValues))
} else {
return(cbind(combinedData, data.frame(summary = summaryValues)))
}
}
The summaryCurve function takes several arguments:
- datasets: a list of data frames that hold the info for all curves
- Datasets must have at least one x-column, can have multiple y (multiple curves)
- Different datasets can be of different length (i.e. x-values can have different ranges)
- columnInfo: list of column name mappings of x and y values per dataset
- if only one name is provide, this is to be assumed the column names of the x-values and all other columns are treated as y values (1 or more). NA is allowed in y values
- if multiple values are provided per dataset, the first refers to the x-values, all other values to specific columns to be treated as y-values (other will be ignored)
- summaryFunction: the function to be applied to all curves. Default is 'mean' but can be anything like min, max, sum, ... even custom function, as long as it outputs one value for all y-values of at a certain x.
- interpolationMethod: defaults to 'linear', all curves are interpolated (but not extended) to provide the best summary between curves of different detail and filling in missing values. Other option is "constant" where points are carried forward instead.
- onlyReturnSummary: defaults to TRUE in which case the x and y-values of the summary curve are returned. If FALSE, one dataset with all interpolated curves plus summary function will be returned
- longFormat: defaults to FALSE, if TRUE there is only one y-column and an extra column curve has a factor denoting the points belonging to different curves (can aid in plotting with ggplot)
EXAMPLE APPLYING THE FUNCTION AND PLOTTING (GGPLOT)
Let's start by creating 3 different curves
library("ggplot2")
dataset1 = data.frame(x = 50:6, y = runif(45))
dataset2 = data.frame(theX = seq(1, 55, 4), result1 = runif(14), result2 = LETTERS[1:14])
dataset3 = data.frame(x = c(0, 50), y = c(0,1))
ggplot() +
geom_point(data = dataset1, aes(x = x, y = y1), colour = "darkgreen") +
geom_line(data = dataset1, aes(x = x, y = y1), colour = "darkgreen") +
geom_point(data = dataset2, aes(x = theX, y = result1), colour = "red") +
geom_line(data = dataset2, aes(x = theX, y = result1), colour = "red") +
geom_point(data = dataset3, aes(x = xVal, y = yVal), colour = "blue") +
geom_line(data = dataset3, aes(x = xVal, y = yVal), colour = "blue") +
theme_minimal()
You can see that the curves have different starting and ending points and the x-values do not overlap (some have many more points than others)
Now run the summaryCurve function with the appropriate arguments and plot the results using ggplot:
mySummarycurve = summaryCurve(datasets = list(dataset1, dataset2, dataset3),
columnInfo = list("x", c("theX", "result1"), "xVal"),
summaryFunction = sum, onlyReturnSummary = F,
longFormat = T)
ggplot(mySummarycurve %>% filter(curve != "summary"), aes(x = x, y = y, group = curve)) +
geom_point(aes(colour = curve)) +
geom_line(aes(colour = curve), linetype = 2) +
geom_line(data = mySummarycurve %>% filter(curve == "summary"), colour = "orange") +
geom_area(data = mySummarycurve %>% filter(curve == "summary"),
fill = "gray", alpha = 0.3) +
theme_minimal() + theme(legend.position = "none")
As you can see, the summary function interpolated all curves so they all have matching x-values over which the summary function of choice (in this case sum) is applied. The resulting area is shaded, but that's just done because it was in the initial example in this post.
There you go! Hope you like it and find it useful. I think I might tinker with it bit more and maybe get it to GitHub or something.
Looking forward to your feedback
PJ