I'm a lawyer's, so to begin you get a lawyerly answer—it depends on how you frame the question.
I think of
R as like school algebra—f(x) = y where the three objects (everything in
R is an object) are
x what is in hand
y what is desired
f function object to transform x \rightarrow y
Sometimes, the analyst has a good idea of x & y but is unsure of f (how to get there). Sometimes the analyst has a good idea of y & f but is finding it hard to understand what has to go into x.
The hardest part is often y—what, exactly, is the question. There's a lot to go into that process. The biostatisticians have a different suite of concerns than the econometricians. Theses at undergraduate, masters and doctorate levels, regardless of domain, have differing expectations and conventions. Then, there's the time factor of how much time is available until the due date. And the unexpected. Scroll to the bottom for a war story in the unlikely event of curiosity into ancient history.
I'll think aloud in a thought experiment based on my experience and an idea of what you might be thinking.
For x I have what's called convenience data. Out of thousands of companies for which data might be available, I didn't give each of them an equal chance to make it into my data. First, I limited my source to companies that publish their financial results in a standard format on a regular schedule in a publicly accessible repository and have 10 years of history to report. So, order of magnitude, out of 20 million firms in the US, I'm only looking at 100K or fewer. Excluded are sole proprietorships, general partnerships, family businesses, many businesses being systematically bled dry by private equity investors, firms with initial public offerings after 2012. Within the small group, I'm probably discarding the firms with non-calendar year reporting periods that end in May, for example. That still leaves tens of thousands.
I don't have the resources to look at them all, so I have to choose. I collect CIK keys from the Securities and Exchange Commission's EDGAR system, fire up the random number generator and come up with a pull list of 30. Snap! Some of these are REITs (real estate investment trusts) that don't report they way that manufacturers or hospitality or hotel operators do. The next try will yield other oddball situations. After much flailing, I sit down and ask myself
How do I define the population of firms in such a way to come up with a bushel of appeals rather than a shopping bag from the farmer's market?
I think about it and come up with some criteria. What inferences do I want to derive beyond the specific companies? About the "economy," say. At last I come up with a definition of the population of firms, which is the set of companies in the S&P 500 index for both 2012 and 2021. Let's assume that is 375 firms, which is still too many to process. I can only do 30. But I'd still like to be able to say something about the big slice of the economy represented by that group.
So, like Noah I start lining up the 30 companies to go into my Ark by deciding on a balance of types of firms, some of this, a little of that and a heap of the other. Or maybe a buddy has already come up with a list and I piggy back. These 30 firms are the population that I will be analyzing and I am content to say nothing about the other 340 that could have been among the elect.
However, if I take a random sample, the analysis has standing to represent the entire population, courtesy of the central limit theorem. which more or less guarantee that the mean liquidity ratios will be normally distributed, which opens the world of parametric statistics which is what usually comes to mind when non-statisticians are thinking about the subject.
Next big hurdle is to unpack
effect. Do I mean that profitability was affected by COVID for the last two years in the series? Or do I mean that it changed at the same time? Or do I mean were changes caused by COVID? Those are questions of causation and the mantra is
Correlation does not imply causation
Usually re-enforced by drawing an example of a computationally correct but utterly bonkers result from one of the vast collection of cautionary tales of spurious correlation. Unpacking dealing with detecting less colorful cases is gnarly.
There is a good theoretical foundation for causal inference, but it requires being able to control (hold constant) several measures. The the popular account from The Book of Why by Judea Pearl, which expands of this academic discussion.
Moving on, there is the selection of Y, the response variable to X_i \dots X_n variables. A dimensionless Y such as those you've chosen adjusts for the size effect. And the choice of models is made broader by the fact that Y is a continuous, rather than binary or categorical variable.
There is still an aspect, however, unaccounted for—autocorrelation of time series data. If you are looking at quarterly data, for example, quarter-over-quarter can be much more consistent than year-over-year when looking at financial data. The reason is that companies have fair latitude over the recognition of gains and losses. For example, a liquidation of an asset such as an underperforming subsidiary can be timed to coincide with a quarter having unusually strong operating results to net out a tax liability. Or things may be done or may not be done in anticipation of the date of pricing scheduled stock options. Management is unlikely to make any optional choice that will result in a higher valuation if they can avoid it. Likewise, for options becoming vested, goosing stock price by recognizing some gain is all good if it can be done strictly within the letter of the law on the advice of one of the leading law firms.
Shocking, I know. But there are also many non-nefarious reasons that periodic patterns can arise. The consequence is that the OLS assumption of normality of residuals may well be violated. The standard
qqplot allows detection on when this shows up. Fortunately, there are tools to deal with this, including time series autoregressive models.
I could go on, but go to the dance with who you want to go home with.
So, the war story. Before I became a lawyer, I was a grad student in geology. I had spent two summers strolling through the San Gabriel range in smog and heat, amidst rattlesnakes and thorny bushes and in danger of perishing from wildfire. I was collecting samples of a rock, call it Plaggy, on a kilometer grid. Then into the lab to cut them into thin sections glued to glass slides. The goal was to see how the mineral makeup varied by location, and therefore the chemical composition, to see how this very oddball rock formed.
That was the wrong question to put to the mountain. I'd have been better off doing traditional descriptive geology field work because, as it happened, I was chasing an illusion. The project rested on someone else's advice that classifying the samples best could done by a simple measure of refractive index. All I had to do was to melt the rocks in a platinum crucible and measure the refraction of light through the resulting glass. This is an illustration of the Principal Investigator's Principle: give it to a grad student, who won't know it's impossible. The platinum turned out to be a non-starter without the kind of funding that doesn't come to a grad student easily. But there was an alternative—drill little holes in a block of carbon, sprinkle in some rock powder and zap it with an electric arc torch and presto. So, the third summer was over in the materials engineering labs where they had the equipment and I'm beginning to cook, when a guy asks me what I'm doing. I tell him and he says you know, those refraction curves are garbage, right?
I had fallen into the trap of pursuing a thesis that was defined by the methodology, not the actual question. And when the methodology came a cropper, so did the topic. This was 1970, before the internet, let alone the web or Google. But still. I learned the measure twice cut once principle the hard way.