## Saturday, March 26, 2016

### regressions on selected populations

Suppose there are two different sets of factors that affect some kind of outcome, and each moves toward optimizing the outcome, but in a noisy way.  I'm imagining two gradual processes, each affecting a subset of factors that affect the outcome, drifting generally toward values of those factors that increase some real-number-valued function of the factors; the function itself may move, somewhat gradually and smoothly, requiring the processes to chase it.  We might imagine them as multi-dimensional factors, but the language might be easier and more concrete if we imagine that each of them is simply a real number as well; x and y follow stochastic processes that drift in a direction of higher f(x,y).

Suppose, though, that x tends to move more quickly than y; it will, at any given time, tend to be closer to its optimal value.  Imagine, now, an ensemble of these, each trying to optimize the same function but otherwise moving independently of each other.  Cross-sectional differences in y will tend to capture differences in the extent to which the y values have adapted to the most recent function; if the optimal value of y has been increasing, there will be a pronounced tendency for higher values of y in the ensemble to be closer to the optimal value of y, with perhaps some overshooting, but, if y adjusts slowly compared to the rate at which its optimal value changes, probably relatively few points that have too high a value of y.  Values of x, on the other hand, will be scattered more or less randomly around the optimal value of x; if the noise now exceeds the residual mismatch between the average value of x and its current optimum, then the highest values of f will be associated with values of x near the middle of the distribution, rather than near one of its ends.

This is all assuming more or less the right ratios — that adjustment of x isn't too slow compared to changes in its optimum, that adjustment in y is fairly slow compared to changes in its optimum, and that in some practical sense they are each intrinsically of similar importance to f(x,y) relative to the noise in each variable.  The upshot, though, is that if we do a standard linear regression on how f varies with x and y within the observed cross-section, we find that x has little to no linear effect, while y does have one; even if we add nonlinear terms to the regression, x contributes only second-order terms, while y has both first- and second-order terms, and in general if you can pick to know only either x or y from a datapoint sampled from the population, knowing only y will enable you to predict f(x,y) with much more accuracy than knowing only x would.

There's literature suggesting that parenting is less important to a child's outcomes than genes are.  This is, however, conditional on most parents' trying to be decent parents; it is certainly the case that environment can matter, as we see with the victims of Romanian orphanages and other examples of violence and neglect.  It seems likely to me that parenting isn't, in some sense, "less important" than genes; it's just that most people have mostly done a decent job of adjusting their parenting to modern times, such that observed variations tend to have second-order effects, while our genes haven't caught up in the same way, and that differences there tend to be first-order.