Hi. I have a dataset with various independent variables, and scores similar to credit scores as my dependent variable. The complication is that the scores are only given in quintiles, such as a score between 0 and 19 is given as a 1, 20 to 39 is given as a 2, etc. Do I treat these scores as numeric or as factors? Thank you.

My suggestion: ordered factor. Use `factor`

and specify `levels`

, or just set `ordered`

as `TRUE`

.

I would rephrase the question: Is credit score as the dependent/response variable better treated as continuous or categorical? Is the goal prediction or classification?

My goal is prediction.

As I understand it, regression will create dummy variables of a variable that is declared as a factor.

What is the effect of ordering the levels? Why does that matter in a regression? Does it matter in other algorithms?

This is analogous to categorizing the scores into bins. The assumption of normality of residuals is violated.

```
# Load libraries
library(haven)
# Read in data
odata <- read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
# ols model with the single quantitative predictor
misfit <- lm(apply ~ gpa, data = odata)
summary(misfit)
#>
#> Call:
#> lm(formula = apply ~ gpa, data = odata)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.7917 -0.5554 -0.3962 0.4786 1.6012
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.22016 0.25224 -0.873 0.38329
#> gpa 0.25681 0.08338 3.080 0.00221 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.6628 on 398 degrees of freedom
#> Multiple R-squared: 0.02328, Adjusted R-squared: 0.02083
#> F-statistic: 9.486 on 1 and 398 DF, p-value: 0.002214
plot(misfit,2)
```

```
# changing response variable to a factor throws an error
odata$apply <- factor(odata$apply)
misfit <- lm(apply ~ gpa, data = odata)
#> Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
#> response will be ignored
#> Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
```

Why would normality of the errors matter with 400 obseravtions?

Linear models assume that the response is continuous and the error has a normal distribution.

Linear models do not necessarily assume the error has a normal distribution. For example, the Gauss-Markov theorem does not depend on the error distribution being normal.

Having a normal distribution matters for some things, although very few things if there are a large number of observations.

You’re right on both counts, and I shouldn’t have overgeneralized. For purposes of queuing up a predictive analysis of an ordinal response I wanted to steer thinking away from OLS where the problems seem obvious.

Yeah, getting the equation right is the most important thing. Couldn't agree more.

I was actually suggesting ordinal logistic model, not OLS. But may be there are other and possibly better alternatives as well.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.