calling the lsmeans() function within a user-defined function does not work with substitute()

simRock · February 29, 2020, 12:29am

Hi, I am trying to call a second function lsmeans() within my user-defined function. Although the first one with just glm() works, the second one with lsmeans() does not.
@raytong and @ technocrat or could anyone help?

set.seed(234)
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)
rm(sex, age) #remove sex and age from the global environment for reproducibility

to_analyze <- function(dep, indep, data){
  glm(substitute(dep ~ factor(indep)), data=data)
}

to_analyze(dep=age, indep=sex, data=dsn)
#> 
#> Call:  glm(formula = substitute(dep ~ factor(indep)), data = data)
#> 
#> Coefficients:
#>  (Intercept)  factor(sex)M  
#>       24.006        -4.034  
#> 
#> Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
#> Null Deviance:       397.3 
#> Residual Deviance: 0.8152    AIC: -191.2

However, I am stuck again because I am trying the call the output from this model in lsmeans::lsmeans() to predict marginal means and return the output but it is giving me an error. Any help would be greatly appreciated

to_predict_lsmeans <- function(dep, indep, data){
  model <- glm(substitute(dep ~ factor(indep)), data=data)
  pred <- lsmeans::lsmeans(model, substitute(~ factor(indep)), offset=substitute(data)$log(age), type ="response" )
  return(pred)
}

pred <- to_predict_lsmeans(dep=age, indep=sex, data=dsn)
#> Error in ref_grid(object, ...): We are unable to reconstruct the data.
#> The variables needed are:
#>  sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#>  data
#> Does the data still exist? Or you can specify a dataset via 'data = '
pred
#> Error in eval(expr, envir, enclos): object 'pred' not found

technocrat · February 29, 2020, 1:26am

It seems from the FAQ for emmeans (to which lsmeans is a gateway) doesn't like independent variables that are continuous:

But before saying more, I have a question for you: Are you sure your model is meaningful?

If your question concerns only two-level predictors such as sex (coded 1 for female, 2 for male), no problem. The model will produce the same predictions as you’d get if you’d used these as factors.
If any of the predictors has 3 or more levels, you may have fitted a nonsense model, in which case you need to fit a different model that does make sense before doing any kind of post hoc analysis. For instance, the model contains a covariate brand (coded 1 for Acme, 2 for Ajax, and 3 for Al’s), this model is implying that the difference between Acme and Ajax is exactly equal to the difference between Ajax and Al’s, owing to the fact that a linear trend in brand has been fitted. If you had instead coded 1 for Ajax, 2 for Al’s, and 3 for Acme, the model would produce different fitted valu

raytong · February 29, 2020, 2:03pm

Hi @simRock. The error is due to the data is not available in the lsmeans evaluation environment. You can attach the data in the search path, so lsmeans can access it. And remember detach the data after use.

set.seed(234)
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)
rm(sex, age) #remove sex and age from the global environment for reproducibility

to_predict_lsmeans <- function(dep, indep, data){
  model <- glm(as.formula(sprintf("%s ~ factor(%s)", dep, indep)), data=data)
  attach(data)
  res <- eval(parse(text = sprintf("lsmeans::lsmeans(model, ~ factor(%s), offset = log(%s), type = 'response')", indep, dep)))
  detach(data)
  res
}

to_predict_lsmeans(dep="age", indep="sex", data=dsn)
#>  sex lsmean      SE  df asymp.LCL asymp.UCL
#>  F    26.98 0.01269 Inf     26.96     27.01
#>  M    22.99 0.01348 Inf     22.97     23.02
#> 
#> Confidence level used: 0.95

^{Created on 2020-02-29 by the reprex package (v0.3.0)}

dromano · February 29, 2020, 7:55pm

Here's a solution based on this section of Hadley Wickham's 'Advanced R' -- to_analyze() is modeled on the lm3() function from the subsection called 'Wrapping modelling functions', and to_predict_lsmeans() is modeled on the first example in the following section, called 'Evaluation environment':

set.seed(234)
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)
rm(sex, age) #remove sex and age from the global environment for reproducibility

library(tidyverse)
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, flatten_raw, invoke, list_along,
#>     modify, prepend, splice
to_analyze <- function(dep, indep, data, env = caller_env()){
  dep <- enexpr(dep)
  indep <- enexpr(indep)
  data <- enexpr(data)
  formula <- expr(!!dep ~ factor(!!indep))
  glm_call <- expr(glm(!!formula, data = !!data))
  expr_print(glm_call)
  eval(glm_call, env)
}

to_predict_lsmeans <- function(dep, indep, data, env = caller_env()){
  dep <- enexpr(dep)
  dep_string <- as_string(dep)
  indep <- enexpr(indep)
  data <- enexpr(data)
  formula <- expr(!!dep ~ factor(!!indep))
  ta_call <- expr(to_analyze(!!dep, !!indep, !!data))
  model <- eval(ta_call, env)
  # need to store current environment so objects created within
  #  function can be accessed, too:
  lsm_env <- env(env, model = model, dep_string = dep_string)
  lsm_call <- 
    expr(
      lsmeans::lsmeans(model, ~ factor(!!indep), offset= log((!!data)[[!!dep_string]]), type ="response" )
    )
  eval(lsm_call, lsm_env)
}

to_predict_lsmeans(dep=age, indep=sex, data=dsn)
#> glm(age ~ factor(sex), data = dsn)
#>  sex lsmean      SE  df asymp.LCL asymp.UCL
#>  F    26.98 0.01269 Inf     26.96     27.01
#>  M    22.99 0.01348 Inf     22.97     23.02
#> 
#> Confidence level used: 0.95

^{Created on 2020-02-29 by the reprex package (v0.3.0)}

simRock · March 3, 2020, 4:33am

Thank you technocrat, raytong and dromano for your helpful insights. Thanks to your insights, I was able to come up with a short cut that seems to work well. In any case, I need to read up Hadley Wickham's Advanced R book to better understand rlang and tidy_eval and the whole quasiquotation

to_analyze <- function (dep, indep, data) {
  dep <- data[,dep]
  indep <- data[,indep]
  
  m0 <- glm(dep ~ factor(indep), data=data)
  lsm1 <- lsmeans::lsmeans(m0, ~ factor(indep), offset=log(data$age), type ="response")
  return(lsm1)
}
to_analyze(dep = "age", indep = "sex", data=dsn) 
#>  indep lsmean      SE  df asymp.LCL asymp.UCL
#>  F      26.97 0.01395 Inf     26.95     27.00
#>  M      23.19 0.01313 Inf     23.16     23.21
#> 
#> Confidence level used: 0.95

Thank you so much

technocrat · March 3, 2020, 4:52am

Great. Please mark the solution for the benefit of those to follow. (No false modesty!)

dromano · March 3, 2020, 11:44am

I just noticed that you hardcoded the offset argument to lsmeans() by using data$age -- is that part of the shortcut, or was it unintended? Just thought I'd check to make sure

simRock · March 4, 2020, 6:01pm

you are correct dromano. it should be offset=log(data$dep). Thankfully it's still works . Thanks

system · March 11, 2020, 6:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.