Hi All,
I have a data set with 10K rows and 500 columns. I am performing outlier analysis on it and the code I wrote is not suitable for big data like this. Basically I want to find residuals
for each of the column combination and then plug in the outlier analysis formula in it. The later part is fast, however the residuals
calculation part is very slow as shown in the code below.
# Data frame of all combinations excluding same columns
modelcols <- subset(expand.grid(c1=names(Data), c2=names(Data),
stringsAsFactors = FALSE), c1!=c2)
# Function
analysisFunc <- function(x,y) {
# Fetch the two columns on which to perform analysis
c1 <- Data[[x]]
c2 <- Data[[y]]
# Create linear model
linearModel <- lm(c1 ~ c2)
# Capture model data from summary
modelData <- summary(linearModel)
# Residuals
residualData <- modelData$residuals
}
# Apply function to return matrix of residuals
linearData <- mapply(analysisFunc, modelcols$c1, modelcols$c2)
# re-naming matrix columns
colnames(linearData) <- paste(modelcols$c1, modelcols$c2, sep="_")
If there any way I can speed it up using some other R
optimization technique I might not be aware of?
Since the data size I am using is very big so not able to create the reproducible data as it's, but sharing below very smaller data:
Data <- read.table(text=" data1 data2 data3 data4
-0.710003 -0.714271 -0.709946 -0.713645
-0.710458 -0.715011 -0.710117 -0.714157
-0.71071 -0.714048 -0.710235 -0.713515
-0.710255 -0.713991 -0.709722 -0.71397
-0.710585 -0.714491 -0.710223 -0.713885
-0.710414 -0.714092 -0.710166 -0.71434
-0.711255 -0.714116 -0.70945 -0.714173
-0.71097 -0.714059 -0.70928 -0.714059
-0.710343 -0.714576 -0.709338 -0.713644", header=TRUE)
Thanks.