Thursday, January 30, 2014

overfitting and regularization

I'm trying to think through and recast some of the ideas around regularization from fields that do mostly atheoretic modeling of largish data sets.  The general setup is that we have a set of models ℋ — e.g. {yi=mx+b+σεi|m,b,σ are real numbers with σ positive} where ε follows some distribution, though typically we're imagining a set of models that requires far more than 3 real numbers to naturally parameterize it — and we're looking for the one* that best describes the population from which the data are sampled.  Now this really is kind of key; if you mistake your problem for "find the one that best describes the data", that's when you're going to get overfitting — if you have 1000 data that basically follow y=x2+ε and you try to fit a 100-order polynomial to the data, your model is going to depend on the noise in that particular data set and will do less well at fitting "out of sample" — i.e. at describing data from the population that aren't in your sample — than if you had used a simpler model.

On some level it might seem hopeless to account for the data you can't see, but regularization can work quite well, and even makes a certain amount of intuitive sense.  The way it's usually done, I have a set of subsets of ℋ that is much smaller than ℋ (in some sense — typically the set of subsets is of much smaller dimension than ℋ itself, i.e. I can specify a subset with only a couple of parameters even if specifying a particular point in ℋ requires many parameters).  Now I ask, for each subset H, if I randomly select (say) 80% of my data sample and pick the model h in H that best describes that 80% of the data, how well will it tend to fit the other 20% of the data?  Often some of the subsets turn out to do much, much better than others.  It seems reasonable to think that if H does a poor job at this exercise, then even if you pick a model in H that fits all of the data you have, it's going to be hard to trust that that model is a good description of the data you don't see; there's perhaps something about H that makes it prone to pay too much attention to "noise", i.e. to the things about the sample that are not representative of the population.  So you try instead to restrict yourself to subsets of ℋ that seem to do well out-of-sample in-sample, and hope that this implies that they're likely to do well out-of-sample out-of-sample as well.

I've already perhaps recast it slightly from its usual presentation, but I'm trying to recast it further, and look for a way of doing something like regularization but without resort to this set of subsets.  To get there, though, I want to remain focussed on the effect of a single observation on the choice of model within each H.  To some extent, we can take a point x in the population and break down the extent to which it will tend to be poorly fit "out of sample" into two parts:
  • how poorly does it typically fit when x is included in the sample? I.e., for a sample that includes x, if we look at the model in H that best fits that sample, how badly does it fit x?
  • how much worse does it fit when x is not in the sample than when it is?
I wish to emphasize at this point that, even if this depends to some extent on x — i.e. if some points have a greater tendency to be hard to fit out-of-sample than other points — it will still also tend to depend on H, i.e.—some sets of models will be more prone to producing bad out-of-sample fits than other sets of models. Standard optimization techniques allow us to minimize (and observe) how poorly a model fits in sample; I'm looking, then, for an indication of how poorly a model tends to fit out of sample.

Well, here's one potential little gem: if the log-likelihood function of a sample is additively separable in its data points and we can parameterize H such that the log-likelihood functions are continuously differentiable and at least prone toward concavity, then the optimization procedure is fairly straightforward: take derivatives and set to 0.  Well, I think that if, further, the log-likelihood functions associated with different potential observations all have more or less the same second derivatives — in particular, if it is very uncommon that an observation would have a second derivative that was a large (positive or negative) multiple of its average value at that point in H and points somewhat near that point in H — then there shouldn't be much of an overfitting problem; the amount worse that a point tends to fit when it's not in the sample than when it's in the sample is going to be constrained by those second derivatives.

I don't know whether this goes anywhere, but if I can find a reasonable way of looking at an ℋ and constructing a reasonably rich subset that satisfies the second-derivative constraint on a reasonable "support" in the space of potential observations, then that would appeal to me as somewhat less arbitrary than imposing a system of subsets at the beginning.

Insofar as the matching of the second derivatives is exact, this would mean the likelihood functions would only differ from each other or some common standard by functions that are linear in the parameters.  Particularly where ℋ lacks a natural parameterization, but even where it does not, this tempts me to try to use these deviations themselves as a parameterization.  Along manifolds in ℋ that are not robust in this way to overfitting, this parameterization won't work; it might be that this could be put in terms of this parameterization itself, allowing us essentially to carve out a subset of ℋ on the basis of what we can coherently parameterize in terms of the differences between likelihood functions at different potential data points.

* This is probably the usual setup. On some level I'd prefer to work with sets or posterior probability distributions or some such, but I think the ideas are best worked out with "point estimates" anyway.

I don't know whether this will be useful or confusing, but I record here that an element of ℋ can, for the most part, be viewed as a map from the space of potential observations (say X) to the space of log-likelihood functions; this can get confusing insofar as a log-likelihood function is itself a map from measurable subsets of X to ℋ, which is a bit self-referential.

No comments: