Thursday, January 30, 2014

overfitting and regularization

I'm trying to think through and recast some of the ideas around regularization from fields that do mostly atheoretic modeling of largish data sets.  The general setup is that we have a set of models ℋ — e.g. {yi=mx+b+σεi|m,b,σ are real numbers with σ positive} where ε follows some distribution, though typically we're imagining a set of models that requires far more than 3 real numbers to naturally parameterize it — and we're looking for the one* that best describes the population from which the data are sampled.  Now this really is kind of key; if you mistake your problem for "find the one that best describes the data", that's when you're going to get overfitting — if you have 1000 data that basically follow y=x2+ε and you try to fit a 100-order polynomial to the data, your model is going to depend on the noise in that particular data set and will do less well at fitting "out of sample" — i.e. at describing data from the population that aren't in your sample — than if you had used a simpler model.

On some level it might seem hopeless to account for the data you can't see, but regularization can work quite well, and even makes a certain amount of intuitive sense.  The way it's usually done, I have a set of subsets of ℋ that is much smaller than ℋ (in some sense — typically the set of subsets is of much smaller dimension than ℋ itself, i.e. I can specify a subset with only a couple of parameters even if specifying a particular point in ℋ requires many parameters).  Now I ask, for each subset H, if I randomly select (say) 80% of my data sample and pick the model h in H that best describes that 80% of the data, how well will it tend to fit the other 20% of the data?  Often some of the subsets turn out to do much, much better than others.  It seems reasonable to think that if H does a poor job at this exercise, then even if you pick a model in H that fits all of the data you have, it's going to be hard to trust that that model is a good description of the data you don't see; there's perhaps something about H that makes it prone to pay too much attention to "noise", i.e. to the things about the sample that are not representative of the population.  So you try instead to restrict yourself to subsets of ℋ that seem to do well out-of-sample in-sample, and hope that this implies that they're likely to do well out-of-sample out-of-sample as well.

I've already perhaps recast it slightly from its usual presentation, but I'm trying to recast it further, and look for a way of doing something like regularization but without resort to this set of subsets.  To get there, though, I want to remain focussed on the effect of a single observation on the choice of model within each H.  To some extent, we can take a point x in the population and break down the extent to which it will tend to be poorly fit "out of sample" into two parts:
  • how poorly does it typically fit when x is included in the sample? I.e., for a sample that includes x, if we look at the model in H that best fits that sample, how badly does it fit x?
  • how much worse does it fit when x is not in the sample than when it is?
I wish to emphasize at this point that, even if this depends to some extent on x — i.e. if some points have a greater tendency to be hard to fit out-of-sample than other points — it will still also tend to depend on H, i.e.—some sets of models will be more prone to producing bad out-of-sample fits than other sets of models. Standard optimization techniques allow us to minimize (and observe) how poorly a model fits in sample; I'm looking, then, for an indication of how poorly a model tends to fit out of sample.

Well, here's one potential little gem: if the log-likelihood function of a sample is additively separable in its data points and we can parameterize H such that the log-likelihood functions are continuously differentiable and at least prone toward concavity, then the optimization procedure is fairly straightforward: take derivatives and set to 0.  Well, I think that if, further, the log-likelihood functions associated with different potential observations all have more or less the same second derivatives — in particular, if it is very uncommon that an observation would have a second derivative that was a large (positive or negative) multiple of its average value at that point in H and points somewhat near that point in H — then there shouldn't be much of an overfitting problem; the amount worse that a point tends to fit when it's not in the sample than when it's in the sample is going to be constrained by those second derivatives.

I don't know whether this goes anywhere, but if I can find a reasonable way of looking at an ℋ and constructing a reasonably rich subset that satisfies the second-derivative constraint on a reasonable "support" in the space of potential observations, then that would appeal to me as somewhat less arbitrary than imposing a system of subsets at the beginning.

Insofar as the matching of the second derivatives is exact, this would mean the likelihood functions would only differ from each other or some common standard by functions that are linear in the parameters.  Particularly where ℋ lacks a natural parameterization, but even where it does not, this tempts me to try to use these deviations themselves as a parameterization.  Along manifolds in ℋ that are not robust in this way to overfitting, this parameterization won't work; it might be that this could be put in terms of this parameterization itself, allowing us essentially to carve out a subset of ℋ on the basis of what we can coherently parameterize in terms of the differences between likelihood functions at different potential data points.

* This is probably the usual setup. On some level I'd prefer to work with sets or posterior probability distributions or some such, but I think the ideas are best worked out with "point estimates" anyway.

I don't know whether this will be useful or confusing, but I record here that an element of ℋ can, for the most part, be viewed as a map from the space of potential observations (say X) to the space of log-likelihood functions; this can get confusing insofar as a log-likelihood function is itself a map from measurable subsets of X to ℋ, which is a bit self-referential.

Sunday, January 26, 2014

coordination and liquidity

Effortless "coordination" requires coincidence — Jevons's "double coincidence of wants", for example. Even Jevons understated the problem in the real world, where typically the details will need to be coordinated between buyer and seller; I may most prefer a red 1.3 cubic foot 1000 watt microwave, and find a black 1.2 cubic foot 1100 watt microwave that is good enough; sellers will accommodate what buyers want to varying degrees, but the costs associated with providing each buyer exactly what that buyer wants (frequently giving up economies of scale in the process) often dwarf the value the buyer would put on that.

I want to record some examples:
  • men's suits can be bought off the rack in various sizes but may also, much more expensively, be custom tailored
  • an employer may find it useful to have an employee who can be called in at a moment's notice to deal with an unexpected issue, while an employee may well prefer to have work hours that are limited to certain hours in the day or certain days in the week
  • a buyer might incur a sudden desire to buy something at a different time from that at which the seller finds most convenient to produce it; one or both will have to change the timing of the transaction to suit the other
In some ways the second example here is a special case of a more general subgroup of details on which to coordinate, namely that one party or the other will take the brunt in various ways of unexpected developments; each side would like to be able to plan ahead, and one or both of them will to some extent give up that ability, but presumably in such a way that the deal as a whole is still worth doing to both parties. The third example is similar, although it's a little bit hard to construe it exactly in that set, as the "unexpected event" is the desire of the other agent to do the deal in the first place. I should probably think a bit more about that, probably in the context of repeated/ongoing relationships between buyers and sellers.

I've been thinking about "market liquidity" as essentially a coordination issue, but in the context of the ideas presented here, it seems that what in finance is referred to as "liquidity providing" generally amounts to "conceding flexibility/choice/the ability to plan", and "liquidity taking" generally amounts to "taking advantage of the flexibility offered by someone else". In the context of financial market microstructure it's pretty clear that liquidity provision is in a lot of ways like writing an option that is, at least most obviously, being given away for free; to some extent what I'm trying to do here is to note that the phenomenon is somewhat more general — though especially outside of finance, where heterogeneity is less pervasive than in most markets — and that the final decision to execute a deal (and perhaps to choose its timing) is only one way in which the specifics of a deal will only be fully coordinated when at least one of the parties is willing to go along with something other than what might have been that party's first choice.

Wednesday, January 22, 2014

high-frequency trading and combinatorial auctions

I tend to support a lot of proposals that would somewhat rein-in certain high-frequency trading behaviors; one proposal with some prominent proponents is a system of frequent batch auctions, though I prefer one of my professors' suggestions that orders not be eligible to be cancelled until one second after they are placed. I would also like to note in this context that the discussion of this as a policy question seems usually to conflate regulators with the private institutions running exchanges; a first step by regulators would be to weaken Reg NMS such that privately-run exchanges are clearly allowed to implement these sorts of rules in a way that is not obviously self-defeating; if that proved "insufficient" (in someone's judgment), I hope it would only be later that the costs and benefits of requiring new rules in every market be imposed.

Some of the defenders of high-frequency trading argue that it provides liquidity, and indeed I think it is likely the case that they provide liquidity to somewhat less-high-frequency traders, who in turn provide liquidity to even-lower-frequency traders, and that perhaps that feeds through to a net advantage to savers looking for borrowers and borrowers looking for savers to convert, finally, real goods and services now into real goods and services in the future in an efficient, pro-social way. I think the final ends, though, are very little impeded by gumming up the works at a one-second time scale, and I do believe that a lot of the real investments of real resources into reducing latency and trying to out-strategize the other high-frequency players probably amounts more to wasteful rent-seeking than to socially beneficial creation of infrastructure to improve capital allocation. Some of the opponents of high-frequency traders like to highlight the existence of flash-crashes — which, at some scale, are fairly common, though usually smaller than the most famous such event. I'm not particularly concerned by that, though, and, if that seemed to me like the biggest problem with high-frequency trading, my prescription would probably start and end with the requirement that "market" orders for retail customers be submitted as IOC limit orders at a price that is some multiple (nearish 1) of the price 15 minutes before the order is entered — caveat venditor beyond that.

The summary of the set-up, then, is that I support moderate efforts to curtail high-frequency trading, though I don't think much of some of the arguments that one hears for and against such efforts. What follows, then, is ... well, it's not exactly Devil's Advocacy, but it's an argument that seems plausibly to be stronger than any of these other arguments, and, if someone could convince me that the details I can't work out would point in that direction, I would change my position, or at least modify what sorts of changes I would support.

What I want to note, first, is that markets are incomplete, and that the values of different traded securities are related to each other in possibly occasionally complicated ways. What this means is that, given some (potential) saver's information and risk preferences, in principle that agent could exhibit a fairly complicated demand curve for securities — the price of one security might affect one's demand for another security in an important way, especially if that price is taken to represent the result of some quasi-equilibrium process involving other agents with other information. If markets were complete, this wouldn't be true in any important way, but markets (obviously) are not, and I suspect that this is somewhat important. In some ideal setting, then, we wouldn't run the exchanges for different securities independently of each other; we would run them all together as one big combinatorial auction.

There are at least two problems with big combinatorial auctions, at least of the sort a naive but brilliant mathematician might construct. One is that they are computationally very intense; the complexity is exponential in the number of items being "auctioned", and with thousands of listed securities, we really don't have the computing power to pull this off. The other is that the agents being called upon to submit "demand curves" would themselves face a problem that is way beyond their own ken; they would be asked to specify some possibly complex rule for converting systems of prices into systems of quantities. Even to the extent that people have an intuitive sense of what they think they would like to buy, they tend to be bad at working with more than the simplest of hypotheticals, and it's hard to imagine a language that would allow them to easily but flexibly express what they want to the central computer.

Modern combinatorial auctions address both of these problems, possibly giving up some of the pure perfection that might in principle be available from the impossible straightforward approach by generating a practical and, based on substantial evidence, very good iterative process instead. One of the most popular modern combinatorial auctions is the combinatorial clock auction, sometimes (as at the link) with a finalizing stage, which presents each agent with a much more tractible problem at each step (a single set of hypothetical prices, basically, rather than all possible sets of hypothetical prices), and leaves the computer with a much more tractible problem at each step as well. Details aside, even though the overall algorithm is designed explicitly to handle the fact that agents' values for combinations of items will not simply be a sum of values for each item separately, the iterative process sort of looks as though it's treating each item separately; it's the dynamic nature of the procedure that ultimately takes account of the "complementarity".

Practical combinatorial auctions usually go fewer than 100 rounds (I think), but I have seen results in the matching literature in which an attempt to decentralize the "matching" procedure can require huge number of rounds — typically on the order of the square of the number of agents. What these models have in common with the situations in which combinatorial auctions have found use is that the preferences, while they (in principle) exist in high-dimensional space, occupy a low-dimensional manifold in that space; they are reasonably tightly constrained, even if the constraint isn't "independent values". It's easy for me to imagine that a system with thousands of items and truly complex relationships between them that one might require billions or trillions of iterations to get good convergence to a value near "equilibrium" — and that, short of that, doubling the number of iterations might improve the result by a substantial amount.

If someone can convince me that the incompleteness of the financial markets is ameliorated by something like the current market structure, viewed as a combinatorial auction with many rounds per second, and that reducing the frequency of the "rounds" would significantly reduce the extent to which prices are "correct", then I would at the very least want to be careful about changing the rules.

Monday, January 6, 2014

ASSA conference

The big annual convention of the Allied Social Sciences Associations (I think) took place this past weekend; I attended sessions on Friday and Sunday (spending Saturday with my family instead), and want to record some somewhat random thoughts I had before I lose them elsewhere.
  • One of the sessions I attended yesterday was on "robustness", in which people attempted to say as much as they could about a particular theoretical model while leaving some important part of the model unspecified; often this took the form of bounds, i.e. "without knowing anything about [X], we can still determine that [Y] will be at least [Y0]". In one paper, the author supposed that two agents in the model have the same "information structure", but that the observer (the "we" in the above quotation) doesn't know what it is. The "worst" possible information structure looks a lot like the "worst" possible correlated equilibrium in a related game, a point he never made, and that I still haven't made precise enough to be worth noting to him. I'm not particularly enamored of his paper, but I do like correlated equilibrium, so I might come back and try to figure out what I mean.
  • The godfather of "robustness" is Stephen Morris, who has spent a lot of the last two decades thinking about higher order beliefs (what do I think you think Alice thinks Bob thinks?) and the past decade thinking about what properties of strategic situations are, to varying degrees, insensitive to them, especially in the context of mechanism design. A lot of the 1980's style mechanism design and auction theory says things like "if buyers have valuations that follow probability distribution F, here's the auction that will, in terms of expected value, generate the most revenue". If you don't know F, though, you're kind of lost. So to some extent much of the project is about making things independent of F (and other assumptions, some of them tacit). On a seemingly unrelated note, my brother and I have frequently discussed the idea that some "behavioral" phenomena — people behaving in ways that have internal inconsistencies, or seem clearly in a certain decision problem/strategic situation to be "mistakes" — may result from people's having developed heuristics for doing well in certain common, realistic situations, and carrying them over naively to other scenarios (especially artificial ones in experiments) where they don't work nearly as well. During the conference it occurred to me that this is similar to using a mechanism that has been designed for a particular F. It is also, to some extent, related to the "machine learning" concept of "overfitting" — people adapt so well to some set of experiences that they are overly specialized, and do poorly in more general situations — where "robustness" is related to "regularization", which amounts to restricting yourself to a set of models that is less sensitive to which subset of your data you use, and is hopefully therefore more applicable to data you haven't seen yet.
  • The last set of ideas, and that most closely related to my current main project, is related to a "backward induction" problem I'm having. Traditional game theory solution concepts involve "common knowledge of rationality" — defining recursively, all players are "rational", and all players know that there is common knowledge of rationality. In particular, if everyone knows that a certain action on your part can't be profitable to you, then everyone can analyze the strategic situation with the assurance that you won't take that action. If some action on my part would only be good for me if you do take that action, then everyone can rule out the possibility that I would do that — I won't do it, because I know that you won't make it a good idea. Where this becomes "backward induction" is usually in a multi-stage game; if Alice has to decide on an action, and then I respond to her action, and then you respond to my action, she figures out what you're going to do — in particular, ruling out something that could never be good for you — and, supposing I will do the same analysis, figures out what I am going to do. This is normally the way people should behave in dynamic strategic situations. It turns out that people are terrible at it.
    In my current project, the behavior I'm hoping/expecting to elicit in the lab is ruled out in more or less this way; it's a multi-period situation in which everyone is provided enough information to be able to rule out the possibility that [A] could happen for all of the next 20 periods, and if everyone knows (for sure) that [A] won't happen in the next period, then it's fairly easy to figure out that they shouldn't act in a way to cause [A] to happen in this period. Perfectly rational agents should be able to work their way back, and [A] should never happen. I think it will. I want to be able to formalize it, though. So I'm trying to think about higher-order beliefs, and how one might describe the situation in which [A] happens.
    • One threat to the idea of backward induction is that it requires "common knowledge of rationality", even where "common knowledge of rationality" has been refuted by observed evidence. Suppose you and I are engaged in a repeated interaction with a single "rational" pattern of behavior — you know I will always choose B instead of A because we are both presumed to be "rational" and B is the only choice consistent with backward induction. Typically this last clause means that, if I were to choose A, I know that you would respond in a particular way because you're rational, and because you know how I would respond to that since I'm rational, and that whole chain of events would be worse for me than if I choose B. Having completed the analysis, we decide that if everyone is rational (and knows that everyone else is, etc.), I should choose A. But then, if I choose A, you should presumably infer that I'm not rational — or that I'm not sure you're rational, or not sure you're sure I'm rational, or something. But this seems to blow a hole in the entire foregoing strategic analysis.Now, if this is really the only problem with backward induction, then if nobody acts more than once, we could still get backward induction; if you do the wrong thing, well, that's weird, but maybe it's still reasonable for me to expect everyone whose actions I still have to worry about to be rational. Or maybe it isn't; in any case, I doubt human subjects in such a situation would reliably perform backward induction. It might be interesting to check some day, though, if it happens to fit conveniently into something else.
    • While I'm thinking in terms of which systems of beliefs I think are "reasonable", I should probably look at self-confirming equilibrium; this is the idea that "beliefs" can only evolve in a certain way along a given "path", which would at least constrain how an initial set of beliefs would affect behavior.
    • That might be more compelling if I try to think about normal form. This is kind of an old idea of mine that I've not pursued much, in part because it's not very interesting with rational agents using Bayesian updating, but there was a remark at one of the sessions yesterday that the difference between a "dynamic game" and a "normal-form game" is that in the former one can learn along the way. If you "normalize" the dynamic game and "learn" by Bayesian updating, it turns out that, well, no, there really isn't a difference; if you start with a given set of beliefs in either the dynamic game or its normal form and work out the normal form proper* equilibria or the sequential equilibria of the dynamic game, they're the same. If learning isn't Bayesian, though, then, depending on what it is, different "dynamic" games with the same normal form might result in different outcomes. How this looks in normal form might be interesting, and might be expressible in terms of restrictions on higher order beliefs.

* I think. To get the backward induction right, you need some refinement of Nash equilibrium that at least eliminates equilibria with (weakly) dominated strategies, and I think Myerson's "proper" equilibrium is at least close to the right concept. Actually, there's a paper by Mertens from 1995 that I should probably think about harder. In any case, I think there's a straightforward normal-form equilibrium concept that will encapsulate any "learning" that goes on, and that's really my point.