R (regression) model size gets larger when being saved within a function

I want to create a regression model within another function; but my problem is that when saving the model it becomes really, really big because other data in the environment is being saved with it. Thus, I think the solution might be to handle different environments; this helped me understand this better. Below I have explained the problems in a few steps.

# Helper function just to quickly assess how big the object becomes when being saved.
saveSize <- function (object) {
  tf <- tempfile(fileext = ".RData")
  on.exit(unlink(tf))
  save(object, file = tf)
  file.size(tf)
}

# Subset of columns to be used
subset = 1:4

# Model size to compare with; i.e., not created within a function
model1 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
saveSize(model1)
# Size = 965

# Function where there are other data that should NOT be saved. 
Function2 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  model2 <- lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset)
}
model2 <- Function2(subset)
saveSize(model2) 
# Size = 1148 ; Problematic that size is larger that model 1.

# Solution to above is to create a new environment
Function3 <- function (subset){
  data_not_to_be_saved <- 1:1e+15
  # New environment
  env <- new.env(parent = globalenv())
  env$subset <- subset
  with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset = subset))
}
model3 <- Function3(subset)
saveSize(model3) 
# 1002 # Success: considerably smaller than in Function 2. 



# PROBLEM: Getting solution in Function 3 to work within another function. 

# This function runs but result in large sized object again
# Also note that I do not want to call iris dataset within the lm call. 
Function5 <- function (subset){
  
  data_not_to_be_saved <- 1:1e+15
  
  Function5 <- function (subset) {
    
    env <- new.env(parent = globalenv())
    env$subset <- subset
    env$datainenvorment <- iris
    
    with(env, lm(Sepal.Length ~ Sepal.Width, data = datainenvorment, subset = subset))
  }
  model5 <- Function5(subset)
}

model5 <- Function5(subset)
saveSize(model5) 

Thanks in advance

Likely this is a copy of everything in the function's environment leaking into the closures. There are special rules that prevent that in the base environment. Nina Zumel wrote on this: https://win-vector.com/2014/05/30/trimming-the-fat-from-glm-models-in-r/

2 Likes

I can imagine the following could cause more exotic lm uses to fail, but it seems to work for your use case

saveSize <- function (object) {
  tf <- tempfile(fileext = ".RData")
  on.exit(unlink(tf))
  attr(attr(object$model,which = "terms") , which = ".Environment") <- NULL
  attr(object$terms,which = ".Environment") <- NULL
  save(object, file = tf)
  file.size(tf)
}

1 Like

In Stackoverflow it was pointed out that my solution actually worked, it was just that the difference could not be seen due to too little data in the junk variable in my example. To show the difference more clearly it was proposed to use:

data_not_to_be_saved <- rnorm(10**5)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.