Multilevel linear regression

Psychchick · November 3, 2022, 1:03am

Hi I was wondering if someone could help me fix a maximal multilevel regression model (it doesn’t need to converge). We want to look at what sport has the best effect on kids wellbeing.

Let’s say we have different schools and children, we ask them to rate there activity level and age at the start. We then split the kids up either in outdoor football, indoor basketball and tennis. They do this for 10 weeks and report a well-being score each week.

How would I create a maximal multilevel regression model for this that has fixed and random effects. I then want to simplify this model by looking at Varcorr

Thanks for your help

technocrat · November 3, 2022, 8:11am

The best place to start is with this CRAN Taskview to get an idea of the range of tools available. From there, I suggest reading the lmer4 vignette. Discrete Data Analysis with R or a similar text should also be consulted if well-being is a discrete (categorical) variable. Another resource is the dagR package for directed acyclic graphs to analyze potential mediating and confounding effects. See this review of the topic in Judea Pearl.

A preliminary analysis can help in selecting a modeling approach. Consider f(x) = y, where
f, x and y are R objects, which may, and often are composites of other objects.

For the study, x might be a data.table or data.frame object containing rows representing subjects and columns representing observations of the variables. Let Y be the response variable well being. Although well being possibly varies continuously from barely alive to healthy and happy across a variety of measures—BMI, performance on standardized tests, connectedness and centrality in social network graphs, etc., Y might be assignment to one of a few ordinal categories. The X—predictor—variables, might be school, age cohort, sport, minutes.

y—the return value of f— some measure of the association of X and Y in x. That is *given observed Y what can we infer about unobserved X and what is the relative contribution of Y_i \dots Y_n to what we can infer? (Or the interest may be in the change, from start to finish or from week to week, of Y.)

What choice of a function object—f—is appropriate to the need? That is, what model is appropriate. Taking the naive approach, we might think that a linear combination of the terms of X would tell something. For simplicity, assume Y is binary, rather than categorical and let's apply linear regression using one continuous variable, X1

fit <- lm(Y ~ X1, data = our_data)

We would get a model, but the diagnostics would prove problematic

because Y can take only one of two values—it's binary. This can be seen clearly by a simple scatterplot.

In other words, ordinary least squares regression is an inappropriate tool for a problem involving a binary response variable. For that we want an f in the form of

fit <- glm(Y ~ X1, data = our_data)

and after some further work, we get a profile log-likelihood result.

(Examples taken from here.)

However unless well being represented by Y is simply good or poor, the usual logistic regression model does not fit, either, because there may be poor, subpar, neutral, healthy and excellent to set some. For that case there is ordinal logistic regression. It comes in two flavors: proportional odds and forward continuation ratio. See Harrell ch 13.

There are two further complications.

The ordinality assumption must be satisfied—Y behaves in an ordinal fashion with respect to each X.

The time series nature of the data, which may introduce make the value of Y at t_k dependent upon the value at t_j. (See the case study by Harrell at §14.3).

Often time what is more of a challenge than how.

system · December 15, 2022, 8:11am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.