Faster Linear Model On Big Data

tidyverse

#1

Hi All,

I have a data set with 10K rows and 500 columns. I am performing outlier analysis on it and the code I wrote is not suitable for big data like this. Basically I want to find residuals for each of the column combination and then plug in the outlier analysis formula in it. The later part is fast, however the residuals calculation part is very slow as shown in the code below.

# Data frame of all combinations excluding same columns 
modelcols <- subset(expand.grid(c1=names(Data), c2=names(Data), 
                    stringsAsFactors = FALSE), c1!=c2)

# Function
analysisFunc <- function(x,y) {        
      # Fetch the two columns on which to perform analysis
      c1 <- Data[[x]]
      c2 <- Data[[y]]

      # Create linear model
      linearModel <- lm(c1 ~ c2)

      # Capture model data from summary
      modelData <- summary(linearModel)

      # Residuals
      residualData <- modelData$residuals
}

# Apply function to return matrix of residuals
linearData <- mapply(analysisFunc, modelcols$c1, modelcols$c2)
# re-naming matrix columns
colnames(linearData) <- paste(modelcols$c1, modelcols$c2, sep="_")

If there any way I can speed it up using some other R optimization technique I might not be aware of?

Since the data size I am using is very big so not able to create the reproducible data as it's, but sharing below very smaller data:

Data <- read.table(text="  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.71397
-0.710585   -0.714491   -0.710223   -0.713885
-0.710414   -0.714092   -0.710166   -0.71434
-0.711255   -0.714116   -0.70945    -0.714173
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644", header=TRUE)

Thanks.


#2

Could you say more about what you're trying to accomplish? Although your data frame isn't particularly big (by modern standards), there are 249,500 pairs of columns, each returning 10,000 model residuals, for a total residual matrix of about 2.5 billion elements. Although I don't doubt there are ways to speed up the specific calculations in your example, we may be able to suggest more efficient or effective ways to accomplish your goal if you can tell us more about what you're trying to do.


#3

I am trying to do outlier analysis based on three sigma rule. I first want to find residuals for each and every column combination and then apply the three sigma rule.

It's not about the big data, it's more about the time it takes. Currently, I am looking around 30-40 mins and the data size is only going to create for me.

This is all I am trying to achieve. Please let me know if you need more information.


#4

For a quick and easy speed boost, change mapply to future_map2 from the furrr package. This should be 3-5 times faster depending on your number of cores.

For anything more than that, I think you would have to rethink the problem.