Error in eval(predvars, data, env) : object 'x' not found when creating a function

I'm trying to write a function as a wrapper for boxcox() from the MASS library to be used with mutate() at a later stage. The goal is to replace the original variable by one which is approximately normally distributed.
As you see, I pass a vector 'x ' to my function, which I would have expected to be passed to the function lm(x ~ 1), but instead I get the message Error in eval(predvars, data, env) : object 'x' not found .
The same code without the wrapper function works fine. Where do I go wrong?

library(MASS)

# Box-Cox-Normalization of a numerical vector
boxcox.transform <- function(x) {
  # Note that x~1 counts as a linear model in R, so you can use the
  # boxcox function from MASS
  bc <- boxcox(lm(x ~ 1))

  # lambda that maximises the likelihood
  lambda <- bc$x[bc$y == max(bc$y)]
  
  # print lambda and 95% confidence limits
  # lambda
  # range(out$x[out$y > max(out$y)-qchisq(0.95,1)/2])

  # Box-Cox-transformed variable
  result <- (x^lambda-1)/lambda

  # histogram
  # histogram(result)
  
  # return transformed variable
  return(result)
}

boxcox.transform(mtcars$disp)

First, let's simplify a bit the example:

This works:

init_dta <- mtcars$disp
fit <- lm(init_dta ~ 1) 
bc <- boxcox(fit)
lambda <- bc$x[bc$y == max(bc$y)]
result <- (init_dta^lambda-1)/lambda

But this doesn't work:

my_lm <- function(.x){
  lm(.x)
}

init_dta <- mtcars$disp
fit2 <- my_lm(init_dta ~ 1)
bc <- boxcox(fit2)
#> Error in stats::model.frame(formula = .x, drop.unused.levels = TRUE): object '.x' not found

We can look at the traceback:

10. stats::model.frame(formula = .x, drop.unused.levels = TRUE)
9. eval(mf, parent.frame())
8. eval(mf, parent.frame())
7. lm(formula = .x, y = TRUE, qr = TRUE)
6. eval(call, parent.frame())
5. eval(call, parent.frame())
4. update.default(object, y = TRUE, qr = TRUE, ...)
3. update(object, y = TRUE, qr = TRUE, ...)
2. boxcox.lm(fit2)
1. boxcox(fit2)

So, the problem comes from the call to update(), when running the boxcox function, it is trying to re-run the initial fit, but can not find the variables. You can check the source code of boxcox.lm() to see that call. And we can look at our lm object:

fit2
#> 
#> Call:
#> lm(formula = .x)
#> 
#> Coefficients:
#> (Intercept)  
#>       230.7  

Here the important part is the Call: upon running, lm() captured the exact expression that was used; but since we called it within a function, the formula argument is meaningless. Hence, the later call to update() fails.

So, let's go back to the source code of boxcox.lm() (you can display it with View(boxcox), or ctrl+click on function name, or F2 while on function name, use the menu on top to select boxcox.lm). It looks like this:

function (object, lambda = seq(-2, 2, 1/10), plotit = TRUE, 
  interp = (plotit && (m < 100)), eps = 1/50, xlab = expression(lambda), 
  ylab = "log-Likelihood", ...) 
{
  m <- length(lambda)
  if (is.null(object$y) || is.null(object$qr)) 
    object <- update(object, y = TRUE, qr = TRUE, ...)
  result <- NextMethod()
  if (plotit) 
    invisible(result)
  else result
}

So, the call to update() only happens when y and qr are not present in the lm object. Let's look back at the arguments of the lm() function:

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

By default, qr = TRUE, but y = FALSE, so when calling boxcox, it can't find the data for y, and thus tries to recompute the lm fit. But since the call contains references to variable names instead of the names in the data, it fails to rerun. We finally get our working solution:

library(MASS)

# Box-Cox-Normalization of a numerical vector
boxcox_transform <- function(x) {
  # Note that x~1 counts as a linear model in R, so you can use the
  # boxcox function from MASS
  bc <- boxcox(lm(x ~ 1, y = TRUE))
  
  # lambda that maximises the likelihood
  lambda <- bc$x[bc$y == max(bc$y)]
  
  # print lambda and 95% confidence limits
  # lambda
  # range(out$x[out$y > max(out$y)-qchisq(0.95,1)/2])
  
  # Box-Cox-transformed variable
  result <- (x^lambda-1)/lambda
  
  # histogram
  # histogram(result)
  
  # return transformed variable
  return(result)
}

res1 <- boxcox_transform(mtcars$disp)

init_dta <- mtcars$disp
fit <- lm(init_dta ~ 1) 
bc <- boxcox(fit)


lambda <- bc$x[bc$y == max(bc$y)]
res2 <- (init_dta^lambda-1)/lambda

waldo::compare(res1, res2)
#> v No differences
1 Like

Thank you so much, AlexisW, for this comprehensive and didactic explanation! I learned something today. Feels great!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.