Jones’s Hive Mind: How Your Nation’s IQ Matters So Much More Than Your Own

jonesGarett Jones has built much of his excellent Hive Mind: How Your Nation’s IQ Matters So Much More Than Your Own on foundations that, while relatively well established, are likely surprising (or even uncomfortable) for some people. Here’s a quick list off the top of my head:

  • High scores in one area of IQ tests tends to show up in others – be that visual, maths, vocabulary etc. The “g factor” can capture almost half of the variation in performance across the different tests.
  • IQ is as good as the best types of interviews at predicting employee performance (and most interviews aren’t the “best type”) .
  • IQ is the best single predictor of executive performance, and for performance in the middle to high-end range of the workforce.
  • IQ predicts practical social skills. If you know someone’s IQ and are trying to predict job or school performance, there is little benefit in learning their EQ score. Conversely, if you know their EQ score, their IQ score has valuable information.
  • IQ scores in poor countries predict earning power, just as they do in developed countries.
  • Test scores such as the PISA test are better predictors of a country’s economic performance than years of education.
  • Corruption correlates strongly (negatively) with IQ.
  • IQ scores are good predictors of cooperative behaviour.

And so on.

On that last point, there was one element that I had not fully appreciated. Jones reports an experiment in which players were paired in a cooperative game. High-IQ pairs were five times more cooperative than high-IQ individuals. The link between IQ and cooperation came from smart pairs of players, not smart individual players

Once you put all those pieces together, you reach the punchline of the book, which is an attempt to understand why the link between income and IQ, while positive both across and within countries, is of a larger magnitude across countries.

Jones’s argument builds on that of Michael Kremer’s classic paper, The O-Ring Theory of Economic Development. Kremer’s insight was that if production in an economy consists of many discrete tasks and failure in any one of those tasks can ruin the final output (such as an O-ring failure on a space shuttle), small differences in skills can drive large differences in output between firms. This can lead to high levels of inequality as the high-skilled work together in the same firm, leading them to be disproportionately more productive.

Jones extended Kremer’s argument this by contemplating what the world would look like if it comprised a combination of what he calls an O-ring sector and a foolproof sector. Here’s what I wrote about Jones’s argument previously based on an article he wrote:

The foolproof sector is not as fragile as the more complex O-ring sector and includes jobs such as cleaning, gardening and other low-skill occupations. The key feature of the foolproof sector is that being of low skill (which Jones suggests relates more to IQ than quantity of education) does not necessarily destroy the final product. It only reduces the efficiency with which it is produced. A couple of low-skill workers can substitute for a high-skill worker in the foolproof sector, but they cannot effectively fill the place of a high-skill O-ring sector worker, no matter how many low-skill workers are supplied.

In this economy, low-skill workers will work in the foolproof sector as these firms will pay them more than an O-ring sector firm. High-skill workers are found in both sectors, with their level of participation in each sector such that high-skill workers are paid the same regardless of which sector they work in (the law of one price).

Thus, within a country, firms will pay high-skill workers more than their low-skill counterparts, but not dramatically so. Their wage differential is determined by the difference in their outputs in the foolproof sector.

Across countries, however, things are considerably different. The highest skill workers in a country provide labour for the O-Ring sector. If they are low skilled relative to the high-skilled in other countries, their output in that fragile sector will be much lower. This occurs even for relatively small skill differences. Their income will reflect their low output, with wages also lower in the foolproof sector as high-skill workers apportion themselves between sectors such that the law of one price holds. The net result is much lower wages for workers in comparison to another country with a higher-skill elite.

The picture is a bit more subtle than that, depending on the mix of skills in the economy (which Jones describes in more detail in both the paper and book). But the basic pattern of large income gaps between countries and small gaps within is relatively robust.

One thing I would have liked to have seen more of in the book – although I suspect this might have somewhat been counter to Jones’s objective – would have been for Jones to challenge some of the research. At times it feels like Jones is tiptoeing through a minefield – the book is peppered with distracting qualifications that you feel he has to make to broaden the audience of the book.

But that said, I’m likely not the target audience. And I like the thought of that new audience hearing what Jones has to say.

Mandelbrot (and Hudson’s) The (mis)Behaviour of Markets: A Fractal View of Risk, Ruin, and Reward

mandelbrotIf you have read Nassim Taleb’s The Black Swan you will have come across some of Benoit Mandelbrot’s ideas. However, Mandelbrot and Hudson’s The (mis)Behaviour of Markets: A Fractal View of Risk, Ruin, and Reward offers a much clearer critique of the underpinnings of modern financial theory (there are many parts of The Black Swan where I’m still not sure I understand what Taleb is saying). Mandelbrot describes and pulls apart the contributions of Markowitz, Sharpe, Black, Scholes and friends in a way likely understandable to the intelligent lay reader. I expect that might flow from science journalist Richard Hudson’s involvement in writing the book.

Mandelbrot’s critique rests on two main pillars. The first is that – seemingly stating the obvious – markets are risky. Less obviously, Mandelbrot’s point is that market changes are more violent than often assumed. Second, trouble runs in streaks.

While Mandelbrot’s critique is compelling, it’s much harder to construct plausible alternatives. Mandelbrot offers two new metrics – α (a measure of how wildly prices vary) and H (a measure of the dependence of price changes upon past changes) – but as he notes, the method used to calculate each can result in wild variation in those measures themselves. On H, he states that “If you look across all the studies to date, you find a perplexing range of H values and no clear pattern among them.”

I’ll close this short note with a brief excerpt from near the end of the book painting a picture of what it is like to live in the world Mandelbrot describes (which just happens to be our world):

What does it feel like, to live through a fractal market? To explain, I like to put it in terms of a parable:

Once upon a time, there was a country called the Land of Ten Thousand Lakes. Its first and largest lake was a veritable sea 1,600 miles wide. The next biggest lake was 919 miles across; the third, 614; and so on down to the last and smallest at one mile across. An esteemed mathematician for the government, the Kingdom of Inference and Probable Value, noticed that the diameters scaled downwards according to a tidy, power-law formula.

Now, just beyond this peculiar land lay the Foggy Bottoms, a largely uninhabited country shrouded in dense, confusing mists and fogs through which one could barely see a mile. The Kingdom resolved to chart its neighbour; and so the surveyors and cartographers set out. Soon, they arrived at a lake. The mists barred their sight of the far shore. How broad was it? Before embarking on it, should they provision for a day or a month? Like most people, they worked out what they knew: They assumed this new land was much like their own and that the size of lakes followed the same distribution. So, as they set off blindly in their boats, they assumed they had at least a mile to go and, on average, five miles.

But they rowed and rowed and found no shore. Five miles passed, and they recalculated the odds of how far they had to travel. Again, the probability suggested: five miles to go. So they rowed further – and still no shore in sight. they despaired. Had they embarked upon a sea, without enough provisions for the journey? Had the spirits of these fogs moved the shore?

An odd story, but one with a familiar ring, perhaps, to a professional stock trader. Consider: The lake diameters vary according to a power law, from largest to smallest. Once you have crossed five miles of water, odds are you have another five to go. If you are still afloat after ten miles, the odds remain the same: another ten miles to go. And so on. Of course, you will hit shore at some point; yet at any moment, the probability is stretched but otherwise unchanged.

Why prediction is pointless

One of my favourite parts of Philip Tetlock’s Expert Political Judgment is his chapter examining the reasons for “radical skepticism” about forecasting. Radical skeptics believe that Tetlock’s mission to improve forecasting of political and economic events is doomed as the world is inherently unpredictable (beyond conceding that no expertise was required to know that war would not erupt in Scandinavia in the 1990s). Before reading Expert Political Judgment, I largely fell into this radical skeptic camp (and much of me still resides in it).

Tetlock suggests skeptics have two lines of intellectual descent – ontological skeptics who argue that the properties of the world make prediction impossible, and psychological skeptics who point to the human mind as being unsuited to teasing out any predictability that might exist. Below are excerpts of Tetlock’s examinations of each (together with the occasional rejoinder by Tetlock).

Ontological skeptics

Path dependency and punctuated equilibria

Path-dependency theorists argue that many historical processes should be modeled as quirky path-dependent games with the potential to yield increasing returns. They maintain that history has repeatedly demonstrated that a technology can achieve a decisive advantage over competitors even if it is not the best long-run alternative. …

Not everyone, however, is sold on the wide applicability of increasing-returns, path-dependency views of history. Traditionalists subscribe to decreasing-returns approaches that portray both past and future as deducible from assumptions about how farsighted economic actors, working within material and political constraints, converge on unique equilibria. For example, Daniel Yergin notes how some oil industry observers in the early 1980s used a decreasing-returns framework to predict, thus far correctly, that OPEC’s greatest triumphs were behind it. They expected the sharp rises in oil prices in the late 1970s to stimulate conservation, exploration, and exploitation of other sources of energy, which would put downward pressure on oil prices. Each step from the equilibrium is harder than the last. Negative feedback stabilizes social systems because major changes in one direction are offset by counterreactions. Good judges appreciate that forecasts of prolonged radical shifts from the status quo are generally a bad bet.

Complexity theorists

Embracing complexity theory, they argue that history is a succession of chaotic shocks reverberating through incomprehensibly intricate networks. To back up this claim, they point to computer simulations of physical systems that show that, when investigators link well-established nonlinear relationships into positive feedback loops, tiny variations in inputs begin to have astonishingly large effects. …

McCloskey illustrates the point with a textbook problem of ecology: predicting how the population of a species next year will vary as a function of this year’s population. The model is xt+1 = f(xt), a one-period-back nonlinear differential equation. The simplest equation is the hump: xt+1 = βxt [1 – xt], where the tuning parameter, β, determines the hump’s shape by specifying how the population of deer at t + 1 depends on the population in the preceding period. More deer mean more reproductive opportunities, but more deer also exhaust the food supply and attract wolves. The higher β is, the steeper the hump and the more precipitous the shift from growth to decline. McCloskey shows how a tiny shift in beta from 3.94 to 3.935 can alter history. The plots of populations remain almost identical for several years but, for mysterious tipping-point reasons, the hypothetical populations decisively part ways twenty-five years into the simulation.

We could endlessly multiply these examples of great oaks sprouting from little acorns. For radical skeptics, though, there is a deeper lesson: the impossibility of picking the influential acorns before the fact. Joel Mokyr compares searching for the seeds of the Industrial Revolution to “studying the history of Jewish dissenters between 50 A.D. and 50 B.C.

Game theorists

Radical skeptics can counter, however, that many games have inherently indeterminate multiple or mixed strategy equilibria. They can also note that one does not need to buy into a hyperrational model of human nature to recognize that, when the stakes are high, players will try to second-guess each other to the point where political outcomes, like financial markets, resemble random walks. Indeed, radical skeptics delight in pointing to the warehouse of evidence that now attests to the unpredictability of the stock market.

Probability theorists

If a statistician were to conduct a prospective study of how well retrospectively identified causes, either singly or in combination, predict plane crashes, our measure of predictability—say, a squared multiple correlation coefficient—would reveal gross unpredictability. Radical skeptics tell us to expect the same fate for our quantitative models of wars, revolutions, elections, and currency crises. Retrodiction is enormously easier than prediction.

Psychological skeptics

Preference for simplicity

However cognitively well equipped human beings were to survive on the savannah plains of Africa, we have met our match in the modern world. Picking up useful cues from noisy data requires identifying fragile associations between subtle combinations of antecedents and consequences. This is exactly the sort of task that work on probabilistic-cue learning indicates people do poorly. Even with lots of practice, plenty of motivation, and minimal distractions, intelligent people have enormous difficulty tracking complex patterns of covariation such as “effect y1 rises in likelihood when x1 is falling, x2 is rising, and x3 takes on an intermediate set of values.”

Psychological skeptics argue that such results bode ill for our ability to distill predictive patterns from the hurly-burly of current events. …

We know—from many case studies—that overfitting the most superficially applicable analogy to current problems is a common source of error.

Aversion to ambiguity and dissonance

People for the most part dislike ambiguity—and we shall discover in chapter 3 that this is especially true of the hedgehogs among us. History, however, heaps ambiguity on us. It not only requires us to keep track of many things; it also offers few clues as to which things made critical differences. If we want to make causal inferences, we have to guess what would have happened in counterfactual worlds that exist—if “exist” is the right word—only in our imaginative reenactments of what-if scenarios. We know from experimental work that people find it hard to resist filling in the missing data points with ideologically scripted event sequences.

People for the most part also dislike dissonance … Unfortunately, the world can be a morally messy place in which policies that one is predisposed to detest sometimes have positive effects and policies that one embraces sometimes have noxious ones. … Dominant options—that beat the alternatives on all possible dimensions—are rare.

Need for control

[P]eople will generally welcome evidence that fate is not capricious, that there is an underlying order to what happens. The core function of political belief systems is not prediction; it is to promote the comforting illusion of predictability.

The unbearable lightness of our understanding of randomness

Our reluctance to acknowledge unpredictability keeps us looking for predictive cues well beyond the point of diminishing returns. I witnessed a demonstration thirty years ago that pitted the predictive abilities of a classroom of Yale undergraduates against those of a single Norwegian rat. The task was predicting on which side of a T-maze food would appear, with appearances determined—unbeknownst to both the humans and the rat—by a random binomial process (60 percent left and 40 percent right). The demonstration replicated the classic studies by Edwards and by Estes: the rat went for the more frequently rewarded side (getting it right roughly 60 percent of the time), whereas the humans looked hard for patterns and wound up choosing the left or the right side in roughly the proportion they were rewarded (getting it right roughly 52 percent of the time). Human performance suffers because we are, deep down, deterministic thinkers with an aversion to probabilistic strategies that accept the inevitability of error. … This determination to ferret out order from chaos has served our species well. We are all beneficiaries of our great collective successes in the pursuit of deterministic regularities in messy phenomena: agriculture, antibiotics, and countless other inventions that make our comfortable lives possible. But there are occasions when the refusal to accept the inevitability of error—to acknowledge that some phenomena are irreducibly probabilistic—can be harmful.

Political observers run the same risk when they look for patterns in random concatenations of events. They would do better by thinking less. When we know the base rates of possible outcomes—say, the incumbent wins 80 percent of the time—and not much else, we should simply predict the more common outcome.

Rosenzweig’s The Halo Effect … and the Eight Other Business Delusions That Deceive Managers

rosenzweigPhil Rosenzweig’s The Halo Effect … and the Eight Other Business Delusions That Deceive Managers is largely an exercise of shooting fish in a barrel, but is an entertaining read regardless.

The central premise of the book is that most blockbuster business books (think Good to Great), for all the claims of scientific rigour, are largely exercises in storytelling.

The problem starts because it is difficult to understand company performance, even as it unfolds before our eyes. Most people don’t know good what good leadership looks like. It is hard to know what makes good communication or optimal cohesion or good customer service. The result of this difficulty is that people tend to allow good performance in the areas that they can measure (such as profits) to contaminate their assessment of other company attributes. They endow the company with a halo.

So when a researcher asks people to rate company attributes when they know the business outcome, those ratings are contaminated. If profits are up, people will assign positive attributes to that company. If times are bad, they will assign negative attributes. We exaggerate strength during booms, and faults during falls. All the factors responsible for a company’s rise might suddenly became the reasons for the fall, or be claimed to have never existed in the first place.

As an example, Apple currently has a clean sweep of all nine attributes in Fortune’s “World’s Most Admired Companies” poll – everything from social responsibility to long-term investment value. Is there not a single company in the world that is better than Apple on any of these nine? As Rosenzweig notes, when asked nine questions, people don’t have nine different opinions. They just give their general impression nine times.

Rosenzweig points to one nice experiment by Barry Straw (replicated?), in which Straw asked groups to projects sales and earnings based on financial data. These groups were then given random feedback on their performance. Those with better feedback described their groups as cohesive, motivated and open to change, while those who the experimenter they performed poorly said there was a lack of communication, poor motivation and so on.

Many of the other delusions in the book are likely familiar to someone who knows a bit about stats or experimental design. Don’t confuse correlation and causation. Rigour is not defined by quantity of data. Do not use samples comprising only successes. Social science isn’t physics.

Other delusions are less often stated. Don’t be deluded into believing single explanations. If you added up the explained variance across the various single explanation business studies, you’ll explain 100% of the variance many times over. The explanations are likely correlated. And following a simple formula won’t necessarily work for a business as performance is relative. What if all your competitors also follow the same formula?

The book closes with Rosenzweig’s spin on what leads to company success, which seems out-of-place after the preceding chapters. Some of it makes sense, such as when Rosenzweig points to the need to acknowledge the role of chance, which is almost never threaded into stories of business success. But Rosenzweig’s punchline of the need for strategy and execution feels just like the type of storytelling that he critiques.

Further, when Rosenzweig assesses the performance of three models of good managers – who he approvingly notes share a probabilistic view of the world, realise the role of luck, can make deals under uncertainty, and recognise the need to be vigilant on the changing competitive landscape – it is hard to even agree that all of their actions were successes. Robert Rubin was one of the three. Rosenzweig classes Rubin’s support of the decision to bail out Mexico during the 1995 peso crisis (or more like the exposed US banks – moral hazard anyone?) as a good decision based on the outcome. What is the objective fact not contaminated by a halo? Rosenzweig ends by defending Rubin – who supported deregulation of derivatives trading and was on the board of Citigroup when it was bailed out during the financial crisis – as being more often right than wrong. If nothing else, the strange close to an otherwise good book did give me one more book for the reading pile – Rubin’s In an Uncertain World.

Tetlock and Gardner’s Superforecasting: The Art and Science of Prediction

tetlockPhilip Tetlock and Dan Gardner’s Superforecasting: The Art and Science of Prediction doesn’t quite measure up to Tetlock’s superb Expert Political Judgment (read EPJ first), but it contains more than enough interesting material to make it worth the read.

The book emerged from a tournament conducted by the Intelligence Advanced Research Projects Activity (IARPA), designed to pit teams of forecasters against each other in predicting political and economic events. These teams included Tetlock’s Good Judgment Project (also run by Barbara Mellers and Don Moore), a team from George Mason University (for which I was a limited participant), and teams from MIT and the University of Michigan.

The result of the tournament was such a decisive victory by the Good Judgment Project during the first 2 years that IARPA dropped the other teams for later years. (It wasn’t a completely open fight – prediction markets could not use real money. Still, Tetlock concedes that the money-free prediction markets did pretty well, and there is scope to test them further in the future.)

Tetlock’s formula for a successful team is fairly simple. Get lots of forecasts, calculate the average of the forecast, and give extra weight to the top forecasters – a version of wisdom of the crowds. Then extremize the forecast. If the forecast is a 70% probability, bump up to 85%. If 30%, cut it to 15%.

The idea behind extremising is quite clever. No one in the group has access to all the dispersed information. If everyone had all the available information, this would tend to raise their confidence, which would result in a more extreme forecast. Since we can’t give everyone all the information, extremising is an attempt to simulate what would happen if you did. To get the benefits of this extremising, however, requires diversity. If everyone holds the same information there is no sharing of information to be simulated.

But the book is not so much about why the Good Judgment Project was superior to the other teams. Mostly it is about the characteristics of the top 2% of the Good Judgment Project forecasters – a group that Tetlock calls superforecasters.

Importantly, “superforecaster” is not a label given on the basis of blind luck. The correlation in forecasting accuracy for Good Judgment Project members between one year and next was around 0.65. 70% of superforecasters stay in the top 2% the following year.

Some of the characteristics of superforecasters are to be expected. Whereas the average Good Judgment participant scored better than 70% of the population on IQ, superforecasters were better than about 80%. They were smarter, but not markedly so.

Tetlock argues much of the differences lies in technique, and this is where he focused. When faced with a complex question, superforecasters tended to first break it into manageable pieces. For the question of whether French or Swiss inquiries would discover elevated levels of polonium in Yasser Arafat’s body (had he been poisoned?), they might ask whether polonium (which decays) could be found in a man dead for years, what ways could polonium have made its way into his body etc. They don’t jump straight to the implied question of whether Israel poisoned Arafat (which the question was technically not about).

Superforecasters also tended to take the outside view for each of these sub-questions. What is the base rate of this event? (Not so easy for this Arafat question) It is only then that they take the “inside view” by looking for information idiosyncratic to that particular question.

The most surprising finding (to me) was that superforecasters were highly granular in their probability forecasts and granularity predicts accuracy. People who stick to tens (10%, 20%, 30% etc) are less accurate than those who stick to fives (5%, 10%, 15% etc), who are less accurate than those who use ones (35%, 36%, 37% etc). Rounding superforecaster estimates reduces their accuracy, although this has little effect on regular forecasters. A superforecaster will distinguish between 63% and 65%, and this makes them more accurate.

Partly this granularity is reflected in the updates they make when new information is obtained (although they are also more accurate on their initial estimate). Being a superforecaster requires monitoring the news, and reacting the right amount. There are occasional big updates – which Tetlock suggests superforecasters can make because they are not tied to their forecasts like a professional pundit – but most of the time the tweaks represent an iteration toward an answer.

Tetlock suggests such fine-grained distinctions would not come to people naturally, as making them would not have been evolutionarily favourable. If there is a lion in grass, there are three likely responses – yes, no, maybe – not 100 shades of grey. But the reality is there needs to be a threshold for each, and evolution can act on fine distinctions. A gene that leads people to apply “run” with 1% greater accuracy over many generations will spread.

Superforecasters also suffer less from scope insensitivity. People will pay roughly the same amount to save 2,000 or 200,000 migrating birds. Similarly, when asked whether an event will occur in the next 6 or 12 months, regular forecasters would predict approximately the same probability of the event occurring. Conversely, superforecasters tend to spot the difference in timeframes and adjust their probabilities so, although they did not exhibit perfect scope insensitivity. I expect an explicit examination of base rates would help in reducing that scope insensitivity as it will tend to relate to a timeframe.

A couple of the characteristics Tetlock gives to the superforecasters seem a bit fluffy. Tetlock describes them as having a “growth mindset”, although the evidence presented simply suggests that they work hard and try to improve.

Similarly, Tetlock labels the superforecasters as having “grit”. I’ll just call them conscientious.

Beyond the characteristics of superforecasters, Tetlock revisits a couple of themes from Expert Political Judgment. As a start, there is a need to apply numbers to forecasts, or else they are fluff. Tetlock relates the story of Sharman Kent asking intelligence officers what they took the words “serious possibility” in a National Intelligence estimate to mean (this wording relating to the possibility of a Soviet invasion of Yugoslavia in 1951). The answer turned out to be anything between a 20% and an 80% probability.

Then there is a need for scoring against appropriate benchmarks – such as no change or the base rate. As Tetlock points out, lauding Nate Silver for picking 50 of 50 states in the 2012 Presidential election is a “tad overwrought” if compared to the no-change prediction of 48.

One contrast with the private Expert Political Judgment project was that forecasters in the public IARPA tournament were better calibrated. While the nature of the questions may have been a factor – the tournament questions related to shorter timeframes to allow the tournament to deliver results in a useful time – Tetlock suggests that publicity creates a form of accountability. There was also less difference between foxes and hedgehogs in the public environment.

One interesting point buried in the notes is where Tetlock acknowledges the various schools of thought around how accurate people are, such as the work by Gerd Gigerenzer and friends on the accuracy of our gut instincts and simple heuristics. Without going into a lot of detail, Tetlock declares the “heuristics and biases” program is the best approach to bring error rates in forecasting down. The short training guidelines – contained in the Appendix to the book and targeted to typical biases – improved accuracy by 10%. While Tetlock doesn’t really put his claim to the test by comparing all approaches (What would a Gigerenzer led team do?), the evidence of the success of the Good Judgment team makes it hard, at least for the moment, to argue with.

Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know?

EPJA common summary of Philip Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know? (2006) is that “experts” are terrible forecasters. There is some truth in that summary, but I took a few different lessons from the book. While experts are bad, others are worse. Simple algorithms and more complex models outperform experts. And importantly, forecasting itself is not a completely pointless task.

Tetlock’s book reports on what must be one of the grander undertakings in social science. Cushioned by his recently gained tenure, Tetlock asked a range of experts to predict future events. With the need to see how the forecasts panned out, the project ran for almost 20 years.

The basic methodology was to ask each participant to rate three possible outcomes for a political or economic event on a scale of 0 to 10 on how likely each outcome is (with, assuming some basic mathematical literacy, the sum allocated to the three options being 10). An example questions might be whether a government will retain, lose or strengthen its position after the next election. Or whether GDP growth will be below 1.75 per cent, between 1.75 per cent and 3.25 per cent, or above 3.25 per cent.

Once the results were in, Tetlock scored the participants on two dimensions – calibration and discrimination. To get a high calibration score, the frequency with which events are predicted needs to correspond with their actual frequency. For instance, events predicted to occur with a 10 per cent probability need to occur around 10 per cent of the time, and so on. Given experts made many judgments, these types of calculations could be made.

To score highly on discrimination, the participant needs to assign a score of 1.0 to things that happen and 0 to things that don’t. The closer to the ends of the scale for predictions, the higher the discrimination score. It is possible to be perfectly calibrated but a poor discriminator (fence sitter) through to a perfect discriminator (only using the extreme values correctly).

From Tetlock’s analysis of these scores come the headline findings of the book. I take them as:

  • Experts, who typically have a doctorate and average 12 years experience in their field, barely outperform “chimps” – the chimps being allocation of equal probability of 33 per cent to each potential outcome.
  • However – and this point is one you rarely hear in commentary about the book – the experts outperform unsophisticated forecasters (a role filled by Berkeley undergrads), whose performance is truly woeful. So, when people lament about experts after reading this book, be even more afraid of the forecasts of the general population.
  • The experts were not differentiated on a range of dimensions, such as years of experience or whether they are forecasting on their area of expertise. Subject matter expertise translates less into forecasting accuracy than confidence.
  • The one dimension where forecast accuracy was differentiated is on what Tetlock calls the fox-hedgehog continuum (borrowing from Isiah Berlin). Hedgehogs know one big thing and aggressively expand that idea into all domains, whereas foxes know many small things, are skeptical of grand ideas and stitch together diverse, sometimes conflicting information. Foxes are more willing to change their minds in response to the unexpected, more likely to remember past mistakes, and more likely to see the case for opposing outcomes. And foxes outperformed on both measures of calibration and discrimination.
  • Experts are outperformed by simple algorithms that predict the continuation of the recent past into the future and vastly so by more sophisticated models (generalised autoregressive distributed lag). Political observers would be better off thinking less, and if they know the base rates of possible outcomes, they should simply predict the most common.

As Bryan Caplan argues, Tetlock gives experts a harder time than they might deserve. The “chimps” are helped by a combination of hard questions and constrained answer fields. There is no option to predict one million per cent growth in GDP next year. We might expect experts to shine more if there were “dumb” questions. Further, the mentions of the horrible performance of the Berkeley undergrads, the proxy for unsophisticats, are rare. On the flipside, a baseline for assessment should not be the chimp or these undergrads, but the simple extrapolation algorithms – and there experts measure poorly.

The expected behaviour of the experts may provide a partial defence. They are filling out a survey, and are unlikely generate a model for every question. Many judgements were likely off the top of the head, with no serious stakes (including no public shaming). This does, however, raise the question of why they were so hopeless in their own fields of expertise where they might have some of these models available.

So what is it about foxes and hedgehogs that leads to differences in performance?

As a start, the approach of foxes lines up with the existing literature on forecasting. This literature shows that average predictions of forecasters are generally more accurate than the majority of forecasters for whom the averages are computed, trimming outliers further enhances accuracy, and there is opportunity for further improvement through the Delphi technique. In line with this, Tetlock suggests foxes factor in conflicting considerations in a flexible weighted-averaging fashion into their judgements.

Next, foxes are better Bayesians in that they update their beliefs in response to new evidence and in proportion to the extremity of the odds they placed on possible outcomes. They weren’t perfect Bayesian’s however – when surprised by a result, Tetlock calculated that foxes moved around 59 per cent of the prescribed amount compared to 19 per cent for hedgehogs. In some of the exercises, hedgehogs moved in the opposite direction.

There was a lot of evidence that both foxes and hedgehogs were more egocentric than natural Bayesians. A natural Bayesian would consider the probability of the event occurring if their view of the world is correct (which also has a probability attached to it) and the probability of the event occurring if their understanding of the world was wrong. But few spontaneously factored other views into their assessment of probabilities. When Tetlock broke down his experts’ predictions, the odds were almost always calculated based on their interpretation of the world being correct.

Foxes were also less prone to hindsight effects. Many experts claimed that they assigned higher probabilities to outcomes that materialised than they did. As Tetlock notes, it is hard to say someone got it wrong if they think they got it right. (Is hindsight bias, as suggested by one hedgehog, an adaptive mechanism that unclutters the mind?)

The chapter of the book where the hedgehogs wheel out the defences against their poor performance is somewhat amusing. As Tetlock points out, forecasters who thought they were good at the beginning sounded like radical skeptics about the value of forecasting by the end.

The experts commonly pointed out that their prediction was a near miss, so the result shouldn’t be held against them. But almost no-one said don’t hold the non-occurrence of event against others who predicted it.

They also tended to claim that “I made the right mistake”, as it is better to be safe than sorry. But all of Tetlock’s attempts to adjust the scoring to help hedgehogs in these cases failed to close the gap.

Some hedgehogs claimed that the questions were not over a long enough time period. There are irreversible trends at work in world today, and while specific events might be hard to predict, the shape of the world in the long-term is clear. But the problem is that hedgehogs were ideologically diverse, and only a few could be right about any long-term trends that exist.

One thing that might be said in favour of the hedgehogs is that the accuracy of the average of hedgehog forecasts was similar to the average of fox forecasts. The average fox forecast beats about 70% of foxes, but the average hedgehog forecast beats 95% of hedgehogs. The hedgehogs benefit in that the more extreme mistakes are balanced out. The result is that a team of hedgehogs might curtail each other’s excesses.

A better angle of defence is that the real goal of forecasting is political impact or reputation, where only the confident survive. Hedgehogs are also good at avoiding distraction in high noise environments, which becomes apparent when examining the major weakness of foxes.

Tetlock put some of his experts through a scenario exercise. In this exercise, the high level forecasts were branched into a large number of sub-scenarios, for which probabilities had to be allocated to each. For example, when given the question of whether Canada would break up (this was around a time of the Quebec separatist referendum), combinations of outcomes involving separatist party success at elections, referendum results, economic downturns and levels of acrimony were presented, rather than the simple question of whether Quebec would succeed or not.

As has been show in the behavioural literature, when this type of task is undertaken, the likelihood of the components often sums to more than one. For the Quebec question, the initial probabilities added up to 1.0 for the basic question – as expected – but to an average of 1.58 for the branched scenarios. Foxes, however, suffered the most in this exercise, producing estimates that summed to 2.09.

To constrain this problem, it is common to end the branching exercise with a requirement to adjust the probabilities such that they add to one. But the foxes tended not to end up where they started for the simple question, with the branching followed by adjustment reducing their forecasting accuracy down to the level of hedgehogs.

Given the net result of the scenario exercise was to confuse foxes and fail to open the mind of hedgehogs, it could be suggested to be a low value exercise. For people advocating scenario development, pre-mortems and red teaming, the possibly deleterious effects on some forecasters needs to be considered.

In sum, it’s a grand book. There are some points where deeper analysis would have been handy – such as when he suggests there is disagreement from “psychologists who subscribe to the argument that fast-and-frugal heuristics-simple rules of thumb-perform as well as, or better than, more complex, effort demanding algorithms” without actually examining whether they are at odds with his findings of the forecasting superiority of foxes. But that’s a small niggle in a fine piece of work.

Bias in the World Bank

Last year’s World Development Report 2015: Mind, Society and Behaviour from the World Bank documents many of what seem to be successful behavioural interventions. Many of the interventions are quite interesting and build a case that a behavioural approach can add something to development economics.

The report also rightly received some praise for including a chapter which explored the biases of development professionals. World Bank staff were shown to subjectively interpret data differently depending on the frame, to suffer from the sunk cost bias and to have little idea about the opinions of the poor people they might help. Interestingly, in the brief discussion about what can be done to counteract these biases, there is little discussion about whether it might be better to simply not conduct certain projects.

On a more critical front, Andreas Ortmann sent me a copy of his review of the report that was published in the Journal of Economic Psychology. Ortmann has already put a lot of my reaction into words, so here is an excerpt (a longer excerpt is here):

What the Report does not do, unfortunately, is the kind of red teaming that it advocates as “one way to overcome the natural limitations on judgement among development professionals … In red teaming, an outside group has the role of challenging the plans, procedures, capabilities, and assumptions of an operational design, with the goal of taking the perspective of potential partners or adversaries. …” …

Overall, and notwithstanding the occasional claim of systematic reviewing (p. 155 fn 6), the sampling of the evidence seems often haphazard and partisan. Take as another example, in chapter 7, the discussion of reference points and daily income targeting that was started by Camerer, Babcock, Loewnstein, and Thaler (1997) and brought about studies such as Fehr and Goette (2007). These studies suggested that taxi drivers and bike messengers in high-income settings have target earnings or target hours and do not intertemporally maximize allocation of labor and leisure. The problem with the argument is that several follow-up studies (prominently, the studies by Farber (2005, 2008) questioned the earlier results. Here no mention is made of these critical studies. Instead the authors argue that the failure to maximize intertemporally can also be found in low-income settings. They cite an unpublished working paper investigating bicycle taxi drivers in Kenya and another unpublished working paper citing fishermen in India. Tellingly, the authors (and the scores of commentators they gave them feedback) did not come across a paper, now forthcoming in Journal of Labor Economics, that has been circulating for a couple of years (see Stafford, in press) and that shows, and shows with an unusually rich data set for Florida lobster fishermen, that both participation decisions and hours spent on sea are consistent with a neoclassical model of labor supply. …

There are dozens of other examples of review of the literature that I find troublingly deficient on the basis of articles I know. … But, as mentioned and as I have illustrated with examples above, there is little red teaming on display here. Not that that is a particularly new development. Behavioural Economics, not just in my view, has since the beginning been oversold and much of that over-selling was done by ignoring the considerable controversies that have swirled around it for decades (Gigerenzer, 1996; Kahneman & Tversky, 1996 anyone? …).

The troubling omission of contrarian evidence and critical voices on display in the Report is deplorable because there are important insights that have come out of these debates and the emerging policy implications would be based on less shifty ground if these insights would be taken into account in systematic ways. If you make the case for costly and policy interventions that might affect literally billions of people, you ought to make sure that the evidence on which you base your policy implications is robust.

In sum, it seems to me that the resources that went into the Report would have been better spent had there been adversarial collaborations (Mellers, Hertwig, & Kahneman, 2001) and/or had reviews gone through a standard review process which hopefully would have forced some clear-cut and documented review criteria. A long list of people that gave feedback is not a good substitute for institutional quality control.

Kaufmann’s Shall the Religious Inherit the Earth?: Demography and Politics in the Twenty-First Century

kaufmannWhile I suggested in my post on Jonathan Last’s What to Expect When No One’s Expecting that reading about demographics in developed countries was not uplifting, the consequences described by Last could be considered pretty minor.

A slight tightening of government budgets could be dealt with by raising pension ages by a few years. Incomes may be lower than otherwise, but as Last states, “A decline in lifestyle for a middle-class American retiree might mean canceling cable, moving to a smaller apartment, and not eating out.” Not exactly disastrous – although of more consequence than the subject of almost every other economic debate.

I found it harder to generate the same blasé reaction to Eric Kaufmann’s Shall the Religious Inherit the Earth?: Demography and Politics in the Twenty-First Century. I don’t have a lot of confidence in most long-term projections of fertility, population, religious retention and social opinions, but even if the world described by Kaufmann has only a 10 per cent chance of occurring, it is worth thinking about.

Kaufmann’s basic argument is that the higher fertility of fundamentalist religious groups, together with their high rates of retention, is going to shift in the make up of the populations in the West over the next century, profoundly affecting our politics and freedoms.

The important word in that above sentence is fundamentalist. This is not a case of religious groups breeding faster than the irreligious. Fertility levels for many groups are rapidly converging in the West. Muslim family sizes are shrinking. Catholic families are no larger than those of Protestants.

Where the action lies is within each faith. There the fundamentalists have markedly higher fertility than both the moderates and seculars. And, if anything, that gap is widening.

To give a sense of the power of this higher fertility, the Old Order Amish in the United States have increased from 5,000 people in 1900 to almost a quarter of a million members. In the United Kingdom, Orthodox Jews make up 17 per cent of the Jewish population but three-quarters of Jewish births.

At one point Kaufmann likens the process to the development by insects of resistance to DDT (although he spends little time on the heritability of religiosity). The growth of secularism has produced new resistant strains of religion, with the middle ground between fundamentalism and irreligion hemorrhaging people, revealing a fundamentalist core.

Kaufmann labels these high fertility religious groups as endogenous growth sects. They grow their own rather than convert – mainstream fundamentalists recognise this is where their advantage lies – and they have high rates of retention for their home-grown. As an example, three-quarters of the relative growth in conservative Protestant denominations in the United States in the 20th century was due to fertility differences, not conversion.

So what does this change mean? Kaufmann argues that we may have reached the peak of secular liberalism. The growth of these fundamentalist religious groups is going to start influencing policy and leading to less liberal outcomes.

As a start, fundamentalist Christian, Muslim, Jewish groups have elevated the most illiberal aspects of their traditions to the status of sacred symbols – be that outlandish dress requirements (often of quite recent origin) and positions on women’s roles and family size. This has helped inoculate them against secular trends.

For the United States, those who believe homosexuality or abortion is always wrong have a growing fertility advantage and they are becoming a larger part of the population. Combined with the tendency of children to adopt the positions of their parents, Kaufmann projects a slight increase in those who oppose abortion by mid-century, whereas opposition to homosexuality will decline only marginally. By the end of the century, however, opposition to abortion could increase from 60 to 75 per cent, and increases in opposition to homosexuality will reverse changes in opinion of the last few decades.

Kaufmann projects similar trends will occur in Europe, and he argues that you can’t speak of secular Europe and religious immigrant minorities. In the future the children of the religious minorities will be Europe. Most large European countries will have between 10 and 15 per cent Muslim population in 2050 (From mid-single digits today. Sweden will be more like 20 to 25 per cent). Depending on whether fertility converges, that proportion will grow through to 2100. And importantly for Kaufmann’s thesis, this growth will largely relate to the fundamentalist core.

Kaufmann goes on to suggest that the growth of these fundamentalist groups points to a contradiction in liberalism. The combination of tolerance of fundamentalism with a choice not to reproduce may well be the agent that destroys it.  To do other than tolerate would be against liberalism principles.

Kaufmann also discusses the implications for world politics. One starting point – hard to perceive in the West – is that the world is becoming more religious and is projected to become more so. While rich nations are still tending more secular (for the moment), poorer religious regions are growing faster.

With nation states boundaries generally well-defined, demographic changes within states are the main cause of change in relative size – and superpowers tend to be demographic heavyweights (although to what extent this holds through the 21st century will be interesting to see). Kaufmann quotes Jackson and Howe that it is “[D]ifficult to find any major instance of a state whose regional or global stature has risen while its share of the regional or global population has declined.”

Thus, if you are someone who worries about international geopolitics, trends aren’t going in the right direction – although China and Russia are running into a demographic wall. Kaufmann asks whether the short-term choice is inter-ethnic migration to increase population or accepting a decline in international power?

Put together, Kaufmann’s case worries me more than tales of government deficits due to demographic change. Even if you assign a low probability to Kaufmann’s projections, it provides another strand to the case that low fertility in the secular West is not without costs.

Three podcasts

Here are three I recently enjoyed:

  1. Econtalk: Brian Nosek on the Reproducibility Project – Contains a lot of interesting context about the reproducibility crisis (of which you can get a flavour from my presentation Bad Behavioural Science: Failures, bias and fairy tales).
  2. Econtalk: Phil Rosenzweig on Leadership, Decisions, and Behavioral Economics – The problems with taking behavioural economics findings out of the lab and applying them to business decision making.
  3. Radiolab: The Rhino Hunter – I listen to Radiolab less than when it had more of a science focus, but this podcast on hunting endangered species to save them is excellent.

Last’s What to Expect When No One’s Expecting: America’s Coming Demographic Disaster

lastI’ve recently read a couple of books on demographic trends, and there don’t seem to be a lot of silver linings in current fertility patterns in the developed world. The demographic boat takes a long time to turn around, so many short-term outcomes are already baked in.

Despite the less than uplifting subject, Jonathan Last’s What to Expect When No One’s Expecting: America’s Coming Demographic Disaster is entertaining – in some ways it is a data filled rant.

Last doesn’t see much upside to the low fertility in most of the developed world. Depopulation is generally associated with economic decline. He sees China’s One Child Policy – rather than saving them – as leading them down the path to demographic disaster. Poland needs a 300% increase in fertility just to hold population stable to 2100. The Russians are driving toward demographic suicide. In Germany they are converting prostitutes into elderly care nurses. Parts of Japan are now depopulated marginal land.

And Last sees little hope of a future increase (I have some views on that). He rightly lampoons the United Nations as having no idea. At the time of writing the book, the United Nations optimistically assumed all developed countries would have their fertility rate increase to the replacement level of 2.1 children per woman (although the United Nations has somewhat – but not completely – tempered this optimism via its latest methodology). There was no basis for this assumption, and the United Nations is effectively forecasting blind.

So why the decline? Last is careful to point out that the world is so complicated that it is not clear what happens if you try to change one factor. But he points to several causes.

First, children used to be an insurance policy. If you wanted care in your old age, your children provided it. With government now doing the caring, having children is consumption. Last points to one estimate that social security and medicare in the United States suppresses the fertility rate by 0.5 children per woman (following the citation trail, here’s one source for that claim).

Then there is the pill, which Last classifies as a major backfire for Margaret Sanger. She willed it into existence to stop the middle classes shouldering the burden of the poor, but the middle class have used it more.

Next is government policy. As one example, Last goes on a rant about child car seat requirements (which I feel acutely). It is impossible to fit more than 2 car seats in a car, meaning that transporting a family of five requires an upgrade. This is one of many subtle but real barriers to large family size.

Finally (at least of those factors I’ll mention), there is the cost of children today. Last considers that poorer families are poorer because they chose to have more children, or as Last puts it, “Children have gone from being a marker of economic success to a barrier to economic success.” Talk about maladaptation. (In the preface to the version I read, Last asked why feminists were expending so much effort demanding right to be child free and not railing against the free market for failing women who want children.)

The fertility decline isn’t just a case of people wanting fewer children, as – on average – people fall short of their ideal number of kids. In the UK, the ideal is 2.5, expected is 2.3, actual 1.9. If people could just realise their target number of children, fertility would be higher.

But this average hides some skew – less educated people end up with more than is ideal, educated people end up with way less. By helping the more educated reach their ideal, the dividend could be large.

So what should government do? Last dedicates a good part of the book to the massive catalogue of failures of government policy to boost birth rates. The Soviet Union’s motherhood medals and lump sum payments didn’t stop the decline. Japan’s monthly per child subsidies, daycare centres and paternal leave (plus another half dozen pro-natalist policies Last lists) had little effect. Singapore initially encouraged the decline, but when they changed their minds and started offering tax breaks and other perks for larger families, fertility kept on declining.

This suggests that you cannot bribe people into having babies. As Last points out, having kids is no fun and people aren’t stupid.

Then there is the impossibility of using migration to fill the gap. To keep the United States support ratio (retirees per worker) where it currently is (assuming you wanted to do this), the US would need to add 45 million immigrants between 2025 and 2035. The US would need 10.8 million a year until 2050 to get the ratio somewhere near what it was in 1960. Immigration is not as good for demographic profile as baby making and comes with other problems. Plus the sources of immigrants are going through own transition, so at some point that supply of young immigrants will dry up.

So, if government can’t make people have children they don’t want and can’t simply ship them in, Last asks if they could help people get the children they do want. As children go on to be taxpayers, Last argues government could cut social security taxes for those with more children and make people without children pay for what they’re not supporting. (Although you’d want to make sure there was no net burden of those children across their lives, as they’ll be old people one day too. There are limits to how far you could take that Ponzi scheme.)

Last also suggests eliminating the need for college, one of the major expenses of children. Allowing IQ testing for jobs would be one small step toward this.

Put together, I’m not optimistic much can be done, but Last is right in that there should be some exploration of removing unnecessary barriers (let’s start with those car seat rules).

I’ll close this post where Last closes the book. In a world where the goal is taken to be pleasure, children will never be attractive. So how much of the fertility decline is because modernity has turned us into unserious people?