Linear regression - count

sharon_r · May 14, 2021, 7:03pm

Hello everyone,

I hope that you do not mind helping me with this likely quite simple problem I am having.

Can linear regression be used for count data?

If not - why do universities, as well as online courses, teach linear regression while presenting use cases with count variables?

Your help is much appreciated.

Thank you.

technocrat · May 14, 2021, 7:18pm

Yes. See this UCLA page

startz · May 14, 2021, 7:36pm

Following up on the answer from @technocrat :

There are forms of linear regression, such as Poisson, specifically built for count data. But it is also true that in many instances the "standard" ordinary least squares regression works well.

sharon_r · May 14, 2021, 9:04pm

Thank you very much for your reply

1. Is Poisson a form of linear regression?
1. In which instances does the "standard" ordinary least squares regression work well (and in which it doesn’t)?

Thanks again!

startz · May 14, 2021, 10:54pm

It's certainly a regression, but one could argue about the "linear" part. Basically, a Poisson regression is estimated by maximum likelihood.
If one knows that the data is generated by a Poisson process, a Poisson regression would be better. But an OLS regression should correctly tell you the marginal effect of independent variables on the mean of the counts. The regression will likely have heteroskedasticity issues, so the standard errors should be corrected.

sharon_r · May 14, 2021, 11:54pm

Thank you very much for your detailed answer

Just to make sure, is there any “classic” example for variables that can (almost) always be used in linear regression?

Can, for example, linear regression be used for predicting ‘time until completing a short process’ (a non-negative V)?

Thank you very much. Your help is much appreciated.

technocrat · May 15, 2021, 12:01am

In the context of time series

If the minimum number of customers is at least 100, then the difference between a continuous sample space [100,∞) and the discrete sample space {100,101,102,…} has no perceivable effect on our forecasts. However, if our data contains small counts (0,1,2,…), then we need to use forecasting methods that are more appropriate for a sample space of non-negative integers.

Hyndman

startz · May 15, 2021, 4:34pm

@technocrat could you expand on this a little? Certainly, if the set of possible discrete integers is dense then it isn't much different from a continuous distribution. I don't see the difference though between 100,101,102... and 0, 1, 2...

In a least squares regression all that would happen would be the intercept would be 100 higher in the former than in the latter. I'm probably missing something.

nirgrahamuk · May 15, 2021, 7:05pm

I believe the difference is that while 98 and 99 might be rare but acceptable values -2 and -1 may be impossible

startz · May 15, 2021, 7:20pm

Truncation! Didn't think of that. Thanks very much.

technocrat · May 15, 2021, 7:27pm

Hyndman's point is that if n is sufficiently large, the departure from the assumptions with respect to a continuous variable underlying regression is not so large as to matter. Following the quoted passage, he describes Croston's method for dealing with count forecasts and cites to Vasiliki Christou & Konstantinos Fokianos (2015) On count time series prediction, Journal of Statistical Computation and Simulation, 85:2, 357-373, DOI: 10.1080/00949655.2013.823612 for their use of the Poisson distribution and the negative binomial distribution.

startz · May 15, 2021, 7:49pm

Yes. I was forgetting that these are counts rather than just discrete integers. So n does matter.

sharon_r · May 25, 2021, 7:35am

Thank you!
Seeing this article, I'm a bit confused, and would like your input, if possible
Thanks again.

nirgrahamuk · May 25, 2021, 8:36am

you need to analyse your data to know what you are working with, and what is reasonable/unreasonable to do with it in further analysis such as model building.

You write that you are confused and post to an extract of an article, but you don't ask a question related to it... So I do wonder how technocrat or anyone else might respond to you.

What is your confusion specifically ?

sharon_r · May 25, 2021, 10:19am

Thank you.
I am confused about the idea of using linear regression for count data.
In my understanding, Poisson regression can be used. However, I am not sure about using 'classical' linear regression for count data. For example, can a variable such as 'number of children' be used as a response variable in 'classical' linear regression? May it sometimes be used?
Thanks.

startz · May 25, 2021, 1:21pm

People sometimes use a classical linear regression. It usually gives a not terrible approximation. But a regression designed for count data, such as a Poisson or zero-inflated Poisson is generally better.

sharon_r · May 25, 2021, 4:56pm

Thank you
Just trying to examine the data using the best fit (if possible).
Thanks again.

technocrat · May 27, 2021, 6:49am

That's a fair explanation of the problems with count data in the context of ordinary least squares linear regression—the underlying assumptions for validity of the test is hard to satisfy. Because those assumptions too often go unexamined for all types of applications of the test that's not surprising.

system · June 17, 2021, 6:49am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.