While waiting for the dentist this weekend, I came across this article in the Selangor Times (excerpt):
The Personal and The Professional
Tricia YeohYET another Malaysian incident has made it into international news.
The Wall Street Journal, amongst other newspapers since, has reported on Bank Islam’s suspension of Azrul Azwar Ahmad Tajudin after his analysis of a possible Opposition win at Federal Parliament was presented at a Regional Outlook Forum in Singapore last week.
Azrul’s presentation on Malaysia’s economic and political outlook of 2013 included a section on the domestic political landscape, which outlined three possible scenarios as a result of the upcoming 13th General Election…
Setting aside Tricia’s main point (do go ahead and read the whole thing – I don’t necessarily disagree with it), I want to focus on one very interesting passage in the article, specifically the last part:
For what it’s worth, Azrul did precede his analyses with the acknowledgement that he was no political analyst, and that the prediction was based on a set of factors including: analysis of voter profile, past voting trends in the 2008 election and consequent 16 by-elections, Sarawak state elections, ground visits, voting patterns with identified election issues, as well as assumed conditions under which these would take place.
I could never resist a teaching moment, and this one’s quite interesting. Before going further though, you might want to read what I wrote last year about forecasting, or the blog post that motivated it (written by someone a lot smarter than I am). The comments in the latter are incredibly educational.
But instead of looking at long-period uncertainty, I want to look at the problem from another angle: Is more data really better for forecasting? Or even modelling in general?
For the gist of Tricia’s comment is that a lot of data helps in analysis. Largely, that’s true – the more data you have the better. But in certain circumstances, more data is a hindrance, and even occasionally a game-breaker. This is especially true if the “lots” of data that you have are in the wrong dimension.
Warning: I’ve tried to keep the following discussion as non-technical as possible, but this is an inherently technical subject. If this is hard to follow, skip to the end.
To explain why, let’s look at the nature of data. I tend to view a dataset in three dimensions, two of them across time. First is the span of the data – the period of real time in which the observations lie. The second is the frequency of the data – how many observations there are in a given period. And third is the breadth of the data – how many different variables you have. Out of all of these, I would probably rate frequency as the most critical, with span second, and breadth third (assuming the data I already have is already adequate).
The essential point here is that to “evaluate” (i.e. solve) a model – forecasting or otherwise – you need a certain number of observations. For example: let’s say you’re trying to establish a hypothetical relationship between variable “y” and variable “x” at time “t”. The normal approach is to run a regression of the two variables using standard ordinary least squares (there are other estimation methods, but that’s not relevant to this discussion):
yt = a + b(xt) + ε
…which you can picture graphically as “a” being the intercept at the y-axis when x = 0, and “b” as the slope of the line between different values of x and y over time. ε is the error term, the difference between your estimated line and the actual values observed.
The problem here is that if you have one value of “x” and one value of “y” (mathematically, n=1, where n is the number of observations), that regression equation is not solvable – there is no unique solution set, and there are any number of values of “a” and “b” that are theoretically possible. You cannot make any inferences regarding the relationship between x and y, because you can plug almost anything you want in there and it will work.
So at the bare minimum, you need two observations to “solve” the equation. But with 2 observations, you have no variance but only an estimate of the mean (there is still an unknown variable in the system), which means that you have no idea what level of uncertainty you’re dealing with.
So you need three observations to actually solve a model with results that make sense. In geometric terms, that’s “the minimum number of independent coordinates, which can specify the position of the system completely” (don’t look at me, I’m just quoting the Wiki).
There’s this quotation that I can’t remember who by, but it goes something like this, “Once is an accident, twice is a coincidence, three times it is enemy action.” Seems that there’s an underlying mathematical truth behind it.
In statistical terms, this concept of minimum number of observations is called “degrees of freedom”. I never realised just how important this concept was until thinking about estimation with very small samples. It’s also the reason why researchers are always attempting to “conserve degrees of freedom”, and why parsimonious models (striving for a minimum number of variables) are preferred. It’s not just a mathematical conceit.
In the example above, the degrees of freedom is:
d.o.f. = n – 2.
…which means you can only evaluate the regression if n > 2.
But your problems aren’t over at n=3. With low sample sizes, hypothesis testing on the coefficients (a and b in this case) is usually done against the T-distribution, not a Z-normal distribution. At d.o.f.=1, the T-stat for rejection (two-tail, 95% confidence) is almost three times larger than at d.o.f.=2, and more than six times larger than the equivalent in a Z-normal distribution.
Translation: the smaller the sample size, the larger the range of values that “a” and “b” can take before being statistically rejected as unlikely. Your uncertainty.level is substantially higher, and your ability to make inferences from the estimated relationship is heavily compromised.
Now, does adding more variables (more “data”) help? Say I have an additional series, let’s call it “w”:
yt = a + b(xt) + c(wt) + ε
Here still the d.o.f. = n – 2, and the same restrictions apply. The additional information does not improve your chances of finding a solution or making inferences in very small samples. There would (or should) be an improvement as n gets larger and approaches infinity – my hazy recollection is that 25-30 observations is the general rule of thumb for a reasonable comfort level. Under those circumstances, whether w actually adds anything is now amenable for testing – notice that it’s not always a given.
Let’s take another scenario – let’s say I have a decent number of observations in both y and x, but substantially less observations of w. Does the addition of w add anything to the solution? In all respects the answer is it makes things worse, because now the model is constrained by the number of observations in w.
In terms of forecasting, I'll add another wrinkle and repeat Nick Rowe’s very astute question – if you have a model of y and x at time t, what does that tell you about the value of y at time t + 1? And the correct answer to that would be “nothing”, because you don’t have the values of x at time t + 1. You can certainly assume what values xt+1 might take, but that’s a guess (guesstimate?) with an unknown probability distribution, and not something derived from an evaluated model.
I’m not as restrictive on the subject as Nick is, so I accept yt = f(xt-1) as a valid forecasting approach, if not a very useful model for studying structural relationships. But the implication of this is because you’re regressing against lagged values of x, you need at least one more observation than in yt = f(xt) to solve the model to compensate for the loss of a degree of freedom.
Generalising is always a dangerous business, but I hope the gist of my argument here is clear.
Men and women make subjective judgements and take action based on those judgements all the time – if you want to be mathematical about it, they make a subconscious assessment of the data available to them, and subjectively evaluate a probability distribution of the available actions and their consequences. Experience and expertise help – we’re now looking at a conditional probability distribution (which is generally narrower i.e. less likely to be in error), and not an independent one.
Sometimes this assessment and decision process involves jumping to conclusions or leaps of faith – making judgements or decisions based on little hard data, or even erroneous data. The weights (or probabilities) that we individually attach to particular actions or outcomes might have little relation to the “true” underlying data or what someone else would consider reasonable. What seems “rational” to me might be completely “irrational” to you.
There’s nothing inherently wrong with this; it’s part of what makes us human and what makes life (and people) so interesting. But it does make any such judgements and actions more prone to error. The bottom line is that trying to make sense of the world from events at a single point in time is likely to be a pretty fruitless exercise, especially when you’re trying to peer into the future. Even then, I doubt people will stop trying.
Thank goodness for partial equilibrium models.
ReplyDelete*simple and can throw as many assumptions/variables at it
Can you imagine explaining a general equilibrium one.
Oh god, I'm trying not to. That and identification.
Deleteso.... its not abt azrul but ms yeoh.
ReplyDeleteActually its about both...and neither. I'm generalising a common human trait - extrapolating from incomplete information.
DeleteI don't know if anyone can succinctly explain degrees of freedom to the mathematically-disinclined
ReplyDeleteToo right...the word count on this post is 1,608, which is twice the length of my usual post. And I'm still not sure I got the essence of it across.
Delete