# Linear regression - count

Hello everyone,

I hope that you do not mind helping me with this likely quite simple problem I am having.

Can linear regression be used for count data?

If not - why do universities, as well as online courses, teach linear regression while presenting use cases with count variables?

Thank you.

Yes. See this UCLA page

Following up on the answer from @technocrat :

There are forms of linear regression, such as Poisson, specifically built for count data. But it is also true that in many instances the "standard" ordinary least squares regression works well.

1. Is Poisson a form of linear regression?
1. In which instances does the "standard" ordinary least squares regression work well (and in which it doesn’t)?

Thanks again!

1. It's certainly a regression, but one could argue about the "linear" part. Basically, a Poisson regression is estimated by maximum likelihood.

2. If one knows that the data is generated by a Poisson process, a Poisson regression would be better. But an OLS regression should correctly tell you the marginal effect of independent variables on the mean of the counts. The regression will likely have heteroskedasticity issues, so the standard errors should be corrected.

1 Like

Just to make sure, is there any “classic” example for variables that can (almost) always be used in linear regression?

Can, for example, linear regression be used for predicting ‘time until completing a short process’ (a non-negative V)?

Thank you very much. Your help is much appreciated.

In the context of time series

If the minimum number of customers is at least 100, then the difference between a continuous sample space [100,∞) and the discrete sample space {100,101,102,…} has no perceivable effect on our forecasts. However, if our data contains small counts (0,1,2,…), then we need to use forecasting methods that are more appropriate for a sample space of non-negative integers.

Hyndman

@technocrat could you expand on this a little? Certainly, if the set of possible discrete integers is dense then it isn't much different from a continuous distribution. I don't see the difference though between 100,101,102... and 0, 1, 2...

In a least squares regression all that would happen would be the intercept would be 100 higher in the former than in the latter. I'm probably missing something.

I believe the difference is that while 98 and 99 might be rare but acceptable values -2 and -1 may be impossible

Truncation! Didn't think of that. Thanks very much.

Hyndman's point is that if n is sufficiently large, the departure from the assumptions with respect to a continuous variable underlying regression is not so large as to matter. Following the quoted passage, he describes Croston's method for dealing with count forecasts and cites to Vasiliki Christou & Konstantinos Fokianos (2015) On count time series prediction, Journal of Statistical Computation and Simulation, 85:2, 357-373, DOI: 10.1080/00949655.2013.823612 for their use of the Poisson distribution and the negative binomial distribution.

Yes. I was forgetting that these are counts rather than just discrete integers. So n does matter.

1 Like

Thank you!
Thanks again.

you need to analyse your data to know what you are working with, and what is reasonable/unreasonable to do with it in further analysis such as model building.

You write that you are confused and post to an extract of an article, but you don't ask a question related to it... So I do wonder how technocrat or anyone else might respond to you.

What is your confusion specifically ?

Thank you.
I am confused about the idea of using linear regression for count data.
In my understanding, Poisson regression can be used. However, I am not sure about using 'classical' linear regression for count data. For example, can a variable such as 'number of children' be used as a response variable in 'classical' linear regression? May it sometimes be used?
Thanks.

People sometimes use a classical linear regression. It usually gives a not terrible approximation. But a regression designed for count data, such as a Poisson or zero-inflated Poisson is generally better.

Thank you
Just trying to examine the data using the best fit (if possible).
Thanks again.

That's a fair explanation of the problems with count data in the context of ordinary least squares linear regression—the underlying assumptions for validity of the test is hard to satisfy. Because those assumptions too often go unexamined for all types of applications of the test that's not surprising.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.