Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know?

EPJA common summary of Philip Tetlock’s Expert Political Judgment: How Good Is It? How Can We Know? (2006) is that “experts” are terrible forecasters. There is some truth in that summary, but I took a few different lessons from the book. While experts are bad, others are worse. Simple algorithms and more complex models outperform experts. And importantly, forecasting itself is not a completely pointless task.

Tetlock’s book reports on what must be one of the grander undertakings in social science. Cushioned by his recently gained tenure, Tetlock asked a range of experts to predict future events. With the need to see how the forecasts panned out, the project ran for almost 20 years.

The basic methodology was to ask each participant to rate three possible outcomes for a political or economic event on a scale of 0 to 10 on how likely each outcome is (with, assuming some basic mathematical literacy, the sum allocated to the three options being 10). An example questions might be whether a government will retain, lose or strengthen its position after the next election. Or whether GDP growth will be below 1.75 per cent, between 1.75 per cent and 3.25 per cent, or above 3.25 per cent.

Once the results were in, Tetlock scored the participants on two dimensions – calibration and discrimination. To get a high calibration score, the frequency with which events are predicted needs to correspond with their actual frequency. For instance, events predicted to occur with a 10 per cent probability need to occur around 10 per cent of the time, and so on. Given experts made many judgments, these types of calculations could be made.

To score highly on discrimination, the participant needs to assign a score of 1.0 to things that happen and 0 to things that don’t. The closer to the ends of the scale for predictions, the higher the discrimination score. It is possible to be perfectly calibrated but a poor discriminator (fence sitter) through to a perfect discriminator (only using the extreme values correctly).

From Tetlock’s analysis of these scores come the headline findings of the book. I take them as:

  • Experts, who typically have a doctorate and average 12 years experience in their field, barely outperform “chimps” – the chimps being allocation of equal probability of 33 per cent to each potential outcome.
  • However – and this point is one you rarely hear in commentary about the book – the experts outperform unsophisticated forecasters (a role filled by Berkeley undergrads), whose performance is truly woeful. So, when people lament about experts after reading this book, be even more afraid of the forecasts of the general population.
  • The experts were not differentiated on a range of dimensions, such as years of experience or whether they are forecasting on their area of expertise. Subject matter expertise translates less into forecasting accuracy than confidence.
  • The one dimension where forecast accuracy was differentiated is on what Tetlock calls the fox-hedgehog continuum (borrowing from Isiah Berlin). Hedgehogs know one big thing and aggressively expand that idea into all domains, whereas foxes know many small things, are skeptical of grand ideas and stitch together diverse, sometimes conflicting information. Foxes are more willing to change their minds in response to the unexpected, more likely to remember past mistakes, and more likely to see the case for opposing outcomes. And foxes outperformed on both measures of calibration and discrimination.
  • Experts are outperformed by simple algorithms that predict the continuation of the recent past into the future and vastly so by more sophisticated models (generalised autoregressive distributed lag). Political observers would be better off thinking less, and if they know the base rates of possible outcomes, they should simply predict the most common.

As Bryan Caplan argues, Tetlock gives experts a harder time than they might deserve. The “chimps” are helped by a combination of hard questions and constrained answer fields. There is no option to predict one million per cent growth in GDP next year. We might expect experts to shine more if there were “dumb” questions. Further, the mentions of the horrible performance of the Berkeley undergrads, the proxy for unsophisticats, are rare. On the flipside, a baseline for assessment should not be the chimp or these undergrads, but the simple extrapolation algorithms – and there experts measure poorly.

The expected behaviour of the experts may provide a partial defence. They are filling out a survey, and are unlikely generate a model for every question. Many judgements were likely off the top of the head, with no serious stakes (including no public shaming). This does, however, raise the question of why they were so hopeless in their own fields of expertise where they might have some of these models available.

So what is it about foxes and hedgehogs that leads to differences in performance?

As a start, the approach of foxes lines up with the existing literature on forecasting. This literature shows that average predictions of forecasters are generally more accurate than the majority of forecasters for whom the averages are computed, trimming outliers further enhances accuracy, and there is opportunity for further improvement through the Delphi technique. In line with this, Tetlock suggests foxes factor in conflicting considerations in a flexible weighted-averaging fashion into their judgements.

Next, foxes are better Bayesians in that they update their beliefs in response to new evidence and in proportion to the extremity of the odds they placed on possible outcomes. They weren’t perfect Bayesian’s however – when surprised by a result, Tetlock calculated that foxes moved around 59 per cent of the prescribed amount compared to 19 per cent for hedgehogs. In some of the exercises, hedgehogs moved in the opposite direction.

There was a lot of evidence that both foxes and hedgehogs were more egocentric than natural Bayesians. A natural Bayesian would consider the probability of the event occurring if their view of the world is correct (which also has a probability attached to it) and the probability of the event occurring if their understanding of the world was wrong. But few spontaneously factored other views into their assessment of probabilities. When Tetlock broke down his experts’ predictions, the odds were almost always calculated based on their interpretation of the world being correct.

Foxes were also less prone to hindsight effects. Many experts claimed that they assigned higher probabilities to outcomes that materialised than they did. As Tetlock notes, it is hard to say someone got it wrong if they think they got it right. (Is hindsight bias, as suggested by one hedgehog, an adaptive mechanism that unclutters the mind?)

The chapter of the book where the hedgehogs wheel out the defences against their poor performance is somewhat amusing. As Tetlock points out, forecasters who thought they were good at the beginning sounded like radical skeptics about the value of forecasting by the end.

The experts commonly pointed out that their prediction was a near miss, so the result shouldn’t be held against them. But almost no-one said don’t hold the non-occurrence of event against others who predicted it.

They also tended to claim that “I made the right mistake”, as it is better to be safe than sorry. But all of Tetlock’s attempts to adjust the scoring to help hedgehogs in these cases failed to close the gap.

Some hedgehogs claimed that the questions were not over a long enough time period. There are irreversible trends at work in world today, and while specific events might be hard to predict, the shape of the world in the long-term is clear. But the problem is that hedgehogs were ideologically diverse, and only a few could be right about any long-term trends that exist.

One thing that might be said in favour of the hedgehogs is that the accuracy of the average of hedgehog forecasts was similar to the average of fox forecasts. The average fox forecast beats about 70% of foxes, but the average hedgehog forecast beats 95% of hedgehogs. The hedgehogs benefit in that the more extreme mistakes are balanced out. The result is that a team of hedgehogs might curtail each other’s excesses.

A better angle of defence is that the real goal of forecasting is political impact or reputation, where only the confident survive. Hedgehogs are also good at avoiding distraction in high noise environments, which becomes apparent when examining the major weakness of foxes.

Tetlock put some of his experts through a scenario exercise. In this exercise, the high level forecasts were branched into a large number of sub-scenarios, for which probabilities had to be allocated to each. For example, when given the question of whether Canada would break up (this was around a time of the Quebec separatist referendum), combinations of outcomes involving separatist party success at elections, referendum results, economic downturns and levels of acrimony were presented, rather than the simple question of whether Quebec would succeed or not.

As has been show in the behavioural literature, when this type of task is undertaken, the likelihood of the components often sums to more than one. For the Quebec question, the initial probabilities added up to 1.0 for the basic question – as expected – but to an average of 1.58 for the branched scenarios. Foxes, however, suffered the most in this exercise, producing estimates that summed to 2.09.

To constrain this problem, it is common to end the branching exercise with a requirement to adjust the probabilities such that they add to one. But the foxes tended not to end up where they started for the simple question, with the branching followed by adjustment reducing their forecasting accuracy down to the level of hedgehogs.

Given the net result of the scenario exercise was to confuse foxes and fail to open the mind of hedgehogs, it could be suggested to be a low value exercise. For people advocating scenario development, pre-mortems and red teaming, the possibly deleterious effects on some forecasters needs to be considered.

In sum, it’s a grand book. There are some points where deeper analysis would have been handy – such as when he suggests there is disagreement from “psychologists who subscribe to the argument that fast-and-frugal heuristics-simple rules of thumb-perform as well as, or better than, more complex, effort demanding algorithms” without actually examining whether they are at odds with his findings of the forecasting superiority of foxes. But that’s a small niggle in a fine piece of work.

Bias in the World Bank

Last year’s World Development Report 2015: Mind, Society and Behaviour from the World Bank documents many of what seem to be successful behavioural interventions. Many of the interventions are quite interesting and build a case that a behavioural approach can add something to development economics.

The report also rightly received some praise for including a chapter which explored the biases of development professionals. World Bank staff were shown to subjectively interpret data differently depending on the frame, to suffer from the sunk cost bias and to have little idea about the opinions of the poor people they might help. Interestingly, in the brief discussion about what can be done to counteract these biases, there is little discussion about whether it might be better to simply not conduct certain projects.

On a more critical front, Andreas Ortmann sent me a copy of his review of the report that was published in the Journal of Economic Psychology. Ortmann has already put a lot of my reaction into words, so here is an excerpt (a longer excerpt is here):

What the Report does not do, unfortunately, is the kind of red teaming that it advocates as “one way to overcome the natural limitations on judgement among development professionals … In red teaming, an outside group has the role of challenging the plans, procedures, capabilities, and assumptions of an operational design, with the goal of taking the perspective of potential partners or adversaries. …” …

Overall, and notwithstanding the occasional claim of systematic reviewing (p. 155 fn 6), the sampling of the evidence seems often haphazard and partisan. Take as another example, in chapter 7, the discussion of reference points and daily income targeting that was started by Camerer, Babcock, Loewnstein, and Thaler (1997) and brought about studies such as Fehr and Goette (2007). These studies suggested that taxi drivers and bike messengers in high-income settings have target earnings or target hours and do not intertemporally maximize allocation of labor and leisure. The problem with the argument is that several follow-up studies (prominently, the studies by Farber (2005, 2008) questioned the earlier results. Here no mention is made of these critical studies. Instead the authors argue that the failure to maximize intertemporally can also be found in low-income settings. They cite an unpublished working paper investigating bicycle taxi drivers in Kenya and another unpublished working paper citing fishermen in India. Tellingly, the authors (and the scores of commentators they gave them feedback) did not come across a paper, now forthcoming in Journal of Labor Economics, that has been circulating for a couple of years (see Stafford, in press) and that shows, and shows with an unusually rich data set for Florida lobster fishermen, that both participation decisions and hours spent on sea are consistent with a neoclassical model of labor supply. …

There are dozens of other examples of review of the literature that I find troublingly deficient on the basis of articles I know. … But, as mentioned and as I have illustrated with examples above, there is little red teaming on display here. Not that that is a particularly new development. Behavioural Economics, not just in my view, has since the beginning been oversold and much of that over-selling was done by ignoring the considerable controversies that have swirled around it for decades (Gigerenzer, 1996; Kahneman & Tversky, 1996 anyone? …).

The troubling omission of contrarian evidence and critical voices on display in the Report is deplorable because there are important insights that have come out of these debates and the emerging policy implications would be based on less shifty ground if these insights would be taken into account in systematic ways. If you make the case for costly and policy interventions that might affect literally billions of people, you ought to make sure that the evidence on which you base your policy implications is robust.

In sum, it seems to me that the resources that went into the Report would have been better spent had there been adversarial collaborations (Mellers, Hertwig, & Kahneman, 2001) and/or had reviews gone through a standard review process which hopefully would have forced some clear-cut and documented review criteria. A long list of people that gave feedback is not a good substitute for institutional quality control.

Kaufmann’s Shall the Religious Inherit the Earth?: Demography and Politics in the Twenty-First Century

While I suggested in my post on Jonathan Last’s What to Expect When No One’s Expecting that reading about demographics in developed countries was not uplifting, the consequences described by Last could be considered pretty minor.

A slight tightening of government budgets could be dealt with by raising pension ages by a few years. Incomes may be lower than otherwise, but as Last states, “A decline in lifestyle for a middle-class American retiree might mean canceling cable, moving to a smaller apartment, and not eating out.” Not exactly disastrous – although of more consequence than the subject of almost every other economic debate.

I found it harder to generate the same blasé reaction to Eric Kaufmann’s Shall the Religious Inherit the Earth?: Demography and Politics in the Twenty-First Century. I don’t have a lot of confidence in most long-term projections of fertility, population, religious retention and social opinions, but even if the world described by Kaufmann has only a 10 per cent chance of occurring, it is worth thinking about.

Kaufmann’s basic argument is that the higher fertility of fundamentalist religious groups, together with their high rates of retention, is going to shift in the make up of the populations in the West over the next century, profoundly affecting our politics and freedoms.

The important word in that above sentence is fundamentalist. This is not a case of religious groups breeding faster than the irreligious. Fertility levels for many groups are rapidly converging in the West. Muslim family sizes are shrinking. Catholic families are no larger than those of Protestants.

Where the action lies is within each faith. There the fundamentalists have markedly higher fertility than both the moderates and seculars. And, if anything, that gap is widening.

To give a sense of the power of this higher fertility, the Old Order Amish in the United States have increased from 5,000 people in 1900 to almost a quarter of a million members. In the United Kingdom, Orthodox Jews make up 17 per cent of the Jewish population but three-quarters of Jewish births.

At one point Kaufmann likens the process to the development by insects of resistance to DDT (although he spends little time on the heritability of religiosity). The growth of secularism has produced new resistant strains of religion, with the middle ground between fundamentalism and irreligion hemorrhaging people, revealing a fundamentalist core.

Kaufmann labels these high fertility religious groups as endogenous growth sects. They grow their own rather than convert – mainstream fundamentalists recognise this is where their advantage lies – and they have high rates of retention for their home-grown. As an example, three-quarters of the relative growth in conservative Protestant denominations in the United States in the 20th century was due to fertility differences, not conversion.

So what does this change mean? Kaufmann argues that we may have reached the peak of secular liberalism. The growth of these fundamentalist religious groups is going to start influencing policy and leading to less liberal outcomes.

As a start, fundamentalist Christian, Muslim, Jewish groups have elevated the most illiberal aspects of their traditions to the status of sacred symbols – be that outlandish dress requirements (often of quite recent origin) and positions on women’s roles and family size. This has helped inoculate them against secular trends.

For the United States, those who believe homosexuality or abortion is always wrong have a growing fertility advantage and they are becoming a larger part of the population. Combined with the tendency of children to adopt the positions of their parents, Kaufmann projects a slight increase in those who oppose abortion by mid-century, whereas opposition to homosexuality will decline only marginally. By the end of the century, however, opposition to abortion could increase from 60 to 75 per cent, and increases in opposition to homosexuality will reverse changes in opinion of the last few decades.

Kaufmann projects similar trends will occur in Europe, and he argues that you can’t speak of secular Europe and religious immigrant minorities. In the future the children of the religious minorities will be Europe. Most large European countries will have between 10 and 15 per cent Muslim population in 2050 (From mid-single digits today. Sweden will be more like 20 to 25 per cent). Depending on whether fertility converges, that proportion will grow through to 2100. And importantly for Kaufmann’s thesis, this growth will largely relate to the fundamentalist core.

Kaufmann goes on to suggest that the growth of these fundamentalist groups points to a contradiction in liberalism. The combination of tolerance of fundamentalism with a choice not to reproduce may well be the agent that destroys it.  To do other than tolerate would be against liberalism principles.

Kaufmann also discusses the implications for world politics. One starting point – hard to perceive in the West – is that the world is becoming more religious and is projected to become more so. While rich nations are still tending more secular (for the moment), poorer religious regions are growing faster.

With nation states boundaries generally well-defined, demographic changes within states are the main cause of change in relative size – and superpowers tend to be demographic heavyweights (although to what extent this holds through the 21st century will be interesting to see). Kaufmann quotes Jackson and Howe that it is “[D]ifficult to find any major instance of a state whose regional or global stature has risen while its share of the regional or global population has declined.”

Thus, if you are someone who worries about international geopolitics, trends aren’t going in the right direction – although China and Russia are running into a demographic wall. Kaufmann asks whether the short-term choice is inter-ethnic migration to increase population or accepting a decline in international power?

Put together, Kaufmann’s case worries me more than tales of government deficits due to demographic change. Even if you assign a low probability to Kaufmann’s projections, it provides another strand to the case that low fertility in the secular West is not without costs.

Three podcasts

Here are three I recently enjoyed:

  1. Econtalk: Brian Nosek on the Reproducibility Project – Contains a lot of interesting context about the reproducibility crisis (of which you can get a flavour from my presentation Bad Behavioural Science: Failures, bias and fairy tales).
  2. Econtalk: Phil Rosenzweig on Leadership, Decisions, and Behavioral Economics – The problems with taking behavioural economics findings out of the lab and applying them to business decision making.
  3. Radiolab: The Rhino Hunter – I listen to Radiolab less than when it had more of a science focus, but this podcast on hunting endangered species to save them is excellent.

Last’s What to Expect When No One’s Expecting: America’s Coming Demographic Disaster

I’ve recently read a couple of books on demographic trends, and there don’t seem to be a lot of silver linings in current fertility patterns in the developed world. The demographic boat takes a long time to turn around, so many short-term outcomes are already baked in.

Despite the less than uplifting subject, Jonathan Last’s What to Expect When No One’s Expecting: America’s Coming Demographic Disaster is entertaining – in some ways it is a data filled rant.

Last doesn’t see much upside to the low fertility in most of the developed world. Depopulation is generally associated with economic decline. He sees China’s One Child Policy – rather than saving them – as leading them down the path to demographic disaster. Poland needs a 300% increase in fertility just to hold population stable to 2100. The Russians are driving toward demographic suicide. In Germany they are converting prostitutes into elderly care nurses. Parts of Japan are now depopulated marginal land.

And Last sees little hope of a future increase (I have some views on that). He rightly lampoons the United Nations as having no idea. At the time of writing the book, the United Nations optimistically assumed all developed countries would have their fertility rate increase to the replacement level of 2.1 children per woman (although the United Nations has somewhat – but not completely – tempered this optimism via its latest methodology). There was no basis for this assumption, and the United Nations is effectively forecasting blind.

So why the decline? Last is careful to point out that the world is so complicated that it is not clear what happens if you try to change one factor. But he points to several causes.

First, children used to be an insurance policy. If you wanted care in your old age, your children provided it. With government now doing the caring, having children is consumption. Last points to one estimate that social security and medicare in the United States suppresses the fertility rate by 0.5 children per woman (following the citation trail, here’s one source for that claim).

Then there is the pill, which Last classifies as a major backfire for Margaret Sanger. She willed it into existence to stop the middle classes shouldering the burden of the poor, but the middle class have used it more.

Next is government policy. As one example, Last goes on a rant about child car seat requirements (which I feel acutely). It is impossible to fit more than 2 car seats in a car, meaning that transporting a family of five requires an upgrade. This is one of many subtle but real barriers to large family size.

Finally (at least of those factors I’ll mention), there is the cost of children today. Last considers that poorer families are poorer because they chose to have more children, or as Last puts it, “Children have gone from being a marker of economic success to a barrier to economic success.” Talk about maladaptation. (In the preface to the version I read, Last asked why feminists were expending so much effort demanding right to be child free and not railing against the free market for failing women who want children.)

The fertility decline isn’t just a case of people wanting fewer children, as – on average – people fall short of their ideal number of kids. In the UK, the ideal is 2.5, expected is 2.3, actual 1.9. If people could just realise their target number of children, fertility would be higher.

But this average hides some skew – less educated people end up with more than is ideal, educated people end up with way less. By helping the more educated reach their ideal, the dividend could be large.

So what should government do? Last dedicates a good part of the book to the massive catalogue of failures of government policy to boost birth rates. The Soviet Union’s motherhood medals and lump sum payments didn’t stop the decline. Japan’s monthly per child subsidies, daycare centres and paternal leave (plus another half dozen pro-natalist policies Last lists) had little effect. Singapore initially encouraged the decline, but when they changed their minds and started offering tax breaks and other perks for larger families, fertility kept on declining.

This suggests that you cannot bribe people into having babies. As Last points out, having kids is no fun and people aren’t stupid.

Then there is the impossibility of using migration to fill the gap. To keep the United States support ratio (retirees per worker) where it currently is (assuming you wanted to do this), the US would need to add 45 million immigrants between 2025 and 2035. The US would need 10.8 million a year until 2050 to get the ratio somewhere near what it was in 1960. Immigration is not as good for demographic profile as baby making and comes with other problems. Plus the sources of immigrants are going through own transition, so at some point that supply of young immigrants will dry up.

So, if government can’t make people have children they don’t want and can’t simply ship them in, Last asks if they could help people get the children they do want. As children go on to be taxpayers, Last argues government could cut social security taxes for those with more children and make people without children pay for what they’re not supporting. (Although you’d want to make sure there was no net burden of those children across their lives, as they’ll be old people one day too. There are limits to how far you could take that Ponzi scheme.)

Last also suggests eliminating the need for college, one of the major expenses of children. Allowing IQ testing for jobs would be one small step toward this.

Put together, I’m not optimistic much can be done, but Last is right in that there should be some exploration of removing unnecessary barriers (let’s start with those car seat rules).

I’ll close this post where Last closes the book. In a world where the goal is taken to be pleasure, children will never be attractive. So how much of the fertility decline is because modernity has turned us into unserious people?

Baumeister and Tierney’s Willpower: Rediscovering the Greatest Human Strength

After the recent hullabaloo about whether ego depletion was a real phenomenon, I decided to finally read Roy Baumeister and John Tierney’s Willpower cover to cover (I’ve only flicked through it before).

My hope was that I’d find some interesting additions to my understanding of the debate, but the book tended into the pop science/self-help genre and there was rarely enough depth to add anything to the current debates (see Scott Alexander on that point). That said, it was an easy read and pointed me to a few studies that seem worth checking out.

One area that I have been interested in is the failure of the mathematics around glucose consumption to add up. Baumeister’s argument is that glucose is the scarce resource in the ego depletion equation. Exercising self control depletes our glucose, making us more likely to succumb to later temptations. Replenishing glucose restores our ego.

As plenty of people have pointed out – Robert Kurzban is the critic I am most familiar with – the maths on glucose simply does not add up. The brain does not burn more calories when making a quick decision. Even if it did (say, doubling while making a decision), the short time in which the decision is made means the additional energy expenditure would be miniscule.

Baumeister and Tierney indirectly dealt with the criticism, writing:

Despite all these findings, the growing community of brain researchers still had some reservations about the glucose connection. Some skeptics pointed out that the brain’s overall use of energy remains about the same regardless of what one is doing, which doesn’t square easily with the notion of depleted energy. Among the skeptics was Todd Heatherton….

Heatherton decided on an ambitious test of the theory. He and his colleagues recruited dieters and measured their reactions to pictures of food. Then ego depletion was induced by asking everyone to refrain from laughing while watching a comedy video. After that, the researchers again tested how their brains reacted to pictures of food (as compared with nonfood pictures). Earlier work by Heatherton and Kate Demos had shown that these pictures produce various reactions in key brain sites, such as the nucleus accumbens and a corresponding decrease in the amygdala. The crucial change in the experiment involved a manipulation of glucose. Some people drank lemonade sweetened with sugar, which sent glucose flooding through the bloodstream and presumably into the brain.

Dramatically, Heatherton announced his results during his speech accepting leadership of the Society for Personality and Social Psychology … Heatherton reported that the glucose reversed the brain changes wrought by depletion, a finding he said, that thoroughly surprised him. … Heatherton’s results did much more than provide additional confirmation that glucose is a vital part of willpower. They helped resolve the puzzle over how glucose could work without global changes in the brain’s total energy use. Apparently ego depletion shifts activity from one part of the brain to another. Your brain does not stop working when glucose is low. It stops doing some things and starts doing others.

In an hour of searching, I couldn’t find a publication arising from this particular study – happy for any pointers. (Interestingly, Demos is author of a paper on a failed replication of an ego depletion experiment.) I’m guessing that the initial findings didn’t hold up.

Given the challenges to ego depletion theory, it seems Baumeister is considering tweaking the theory (I found an ungated copy here). If you want a more recent, although not necessarily balanced view on where the theory is at, skip Willpower and start there.

For another perspective on Willpower, see also Steven Pinker’s review.

The Behavioural Economics Guide 2016 (with an intro by Gerd Gigerenzer)

The Behavioural Economics Guide 2016 is out (including a couple of references to yours truly), with the introduction by Gerd Gigerenzer. It’s nice to see some of the debate in the area making an appearance.

Here are a few snippets from Gigerenzer’s piece. First, on heuristics:

To rethink behavioral economics, we need to bury the negative rhetoric about heuristics and the false assumption that complexity is always better. The point I want to make here is not that heuristics are always better than complex methods. Instead, I encourage researchers to help work out the exact conditions under which a heuristic is likely to perform better or worse than some fine-tuned optimization method. First, we need to identify and study in detail the repertoire of heuristics that individuals and institutions rely on, which can be thought of as a box of cognitive tools. This program is called the analysis of the adaptive toolbox and is descriptive in its nature. Second, we need to analyze the environment or conditions under which a given heuristic (or complex model) is likely to succeed and fail. This second program, known as the study of the ecological rationality of heuristics (or complex models), is prescriptive in nature. For instance, relying on one good reason, as the hiatus rule does [If a customer has not made a purchase for nine months or longer, classify him/her as inactive, otherwise as active], is likely to be ecologically rational if the other reasons have comparatively small weights, if the sample size is small, and if customer behavior is unstable.

And the “bias bias”:

The bias bias is the tendency to diagnose biases in others without seriously examining whether a problem actually exists. In decision research, a bias is defined as a systematic deviation from (what is believed to be) rational choice, which typically means that people are expected to add and weigh all information before making a decision. In the absence of an empirical analysis, the managers who rely on the hiatus heuristic would be diagnosed as having committed a number of biases: they pay no attention to customers’ other attributes, let alone to the weight of these attributes and their dependency. Their stubborn refusal to perform extensive calculations might be labeled the “hiatus fallacy” – and provide entry number 176 in the list on Wikipedia. Yet many, including experts, don’t add and weigh most of the time, and their behavior is not inevitably irrational. As the bias–variance dilemma shows, ignoring some information can help to reduce error from variance – the error that arises from fine-tuned estimates that produce mostly noise. Thus, a certain amount of bias can assist in making better decisions.

The bias bias blinds us to the benefits of simplicity and also prevents us from carefully analyzing what the rational behavior in a given situation actually is. I, along with others, have shown that more than a few of the items in the Wikipedia list have been deemed reasoning errors on the basis of a narrow idea of rationality and that they can instead be easily justified as intelligent actions (Gigerenzer et al., 2012).

————–

A recent Spectator article on an interview with Richard Thaler – a contributor the 2016 Guide – opened with the following:

‘For ten years or so, my name was “that jerk”,’ says Professor Richard Thaler, president of the American Economics Association and principal architect of the behavioural economics movement. ‘But that was a promotion. Before, I was “Who’s he?”’

On hearing that Gigerenzer had written the introduction to the Guide, Thaler tweeted:

I suppose Thaler is now the establishment and Gigerenzer is “that jerk”.

Re-reading Kahneman’s Thinking, Fast and Slow

A bit over four years ago I wrote a glowing review of Daniel Kahneman’s Thinking, Fast and Slow. I described it as a “magnificent book” and “one of the best books I have read”. I praised the way Kahneman threaded his story around the System 1 / System 2 dichotomy, and the coherence provided  by prospect theory.

What a difference four years makes. I will still describe Thinking, Fast and Slow as an excellent book – possibly the best behavioural science book available. But during that time a combination of my learning path and additional research in the behavioural sciences has led me to see Thinking, Fast and Slow as a book with many flaws.

First, there is the list of studies that simply haven’t held up through the “replication crisis” of the last few years. The first substantive chapter of Thinking, Fast and Slow is on priming, so many of these studies are right up the front. These include the Florida effect, money priming, the idea that making a test harder to read can increase test results, and ego depletion (I touch on each of these in my recent talk at the Sydney Behavioural Economics and Behavioural Science Meetup).

It’s understandable that Kahneman was somewhat caught out by the replication crisis that has enveloped this literature. But what does not sit so well was the confidence with which Kahneman made his claims. For example, he wrote:

When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.

I am surprised at the blind spot I had when first reading it – Kahneman’s overconfidence didn’t register with me.

As I was also, Kahneman is a fan of the hot hand studies. Someone who believes in the hot hand believes that a sportsperson such as a basketball player is more likely to make a shot if they made their previous one. Kahneman wrote:

The hot hand is entirely in the eye of the beholders, who are consistently too quick to perceive order and causality in randomness. The hot hand is a massive and widespread cognitive illusion. [Could the same be said about much of the priming literature?]

The public reaction to this research is part of the story. The finding was picked up by the press because of its surprising conclusion, and the general response was disbelief. When the celebrated coach of the Boston Celtics, Red Auerbach, heard of Gilovich and his study, he responded, “Who is this guy? So he makes a study. I couldn’t care less.” The tendency to see patterns in randomness is overwhelming – certainly more impressive than a guy making a study.

And now it seems there is a hot hand. The finding that there was no hot hand the consequence of a statistical error (also covered in my recent talk). The disbelief was appropriate, and Auerbach did himself a favour by ignoring the study.

As I’ve picked on Dan Ariely for the way he talks about organ donation rates, here’s Kahneman on that same point:

A directive about organ donation in case of accidental death is noted on an individual’s driver licence in many countries. The formulation of that directive is another case in which one frame is clearly superior to the other. Few people would argue that the decision of whether or not to donate one’s organs is unimportant, but there is strong evidence that most people make their choice thoughtlessly. The evidence comes from a comparison of organ donation rates in European countries, which reveals startling differences between neighbouring and culturally similar countries. An article published in 2003 noted that the organ donation rate was closer to 100% in Austria but only 12% in Germany, 86% in Sweden but only 4% in Denmark.

These enormous differences are a framing effect, which is caused by the format of the critical question. The high-donation countries have an opt-out form, where individuals who wish not to donate must check an appropriate box. Unless they take this simple action, they are considered willing donors. The low-contribution countries have an opt-in form: you must check a box to become a donor. That is all. The best single predictor of whether or not people will donate their organs is the designation of the default option that will be adopted without having to check a box. …

When the role of formulation is acknowledged, a policy question arises: Which formulation should be adopted. In this case, the answer is straightforward. If you believe that a large supply of donated organs is good for society, you will not be neutral between a formulation that yields almost 100% donations and another formulation that elicits donations from 4% of drivers.

As Ariely does, Kahneman describes the difference between European countries as being due to differences in form design, when in fact those European countries with high “donor rates” never ask their citizens whether they wish to be donors. The form described does not exist in the high-donation countries. They are simply presumed to consent to donation. (The paper that these numbers come from, Do Defaults Save Lives?, might have been better titled “Does not asking if you can take people’s organs save lives?”. That could have saved some confusion.)

Further, Kahneman talks about the gap between 100% and 4% as donation rates, when these numbers refer to those who are presumed to consent in the high-donation countries. Actual donation rates and the gap between the different types of countries are much lower.

All the above points are minor in themselves. But together the shaky science, overconfidence and lazy storytelling add up to something substantial.

What I also find less satisfying now is the attempt to construct a framework around the disparate findings in behavioural science. I once saw prospect theory as a great framework for thinking about many of the findings, but it is as unrealistic a decision making model as that for the perfectly rational man – the maths involved is even more complicated. It’s might be a useful descriptive or predictive model (if you could work out what the reference point actually is) but no one makes decisions in that way. (One day I will write a post on this.)

It will be interesting to see how Thinking, Fast and Slow stands up after another five years.

Levine’s Is Behavioural Economics Doomed?

David Levine’s Is Behavioural Economics Doomed? is a good but slightly frustrating read. I agree with Levine’s central argument that rationality is underweighted in many applications of behavioural economics, and he provides many good examples of the power of traditional economic thinking. For someone unfamiliar with game theory, this book is in some ways a good introduction (or more particularly, to the concept of Nash equilibrium). And for some of the points, Levine shows a richness in the literature that you don’t often hear about if you only consume pop behavioural economics books.

But the book is also littered with straw man arguments. Levine often gives views to behavioural economists which I am not sure they generally hold, and he often picks strange examples. And when it comes to explaining away behaviour that doesn’t fit so neatly with the rational actor model, Levine is not always convincing.

As an example, Levine provides an overview of the prisoner’s dilemma, a classic game demonstrating why two people might not cooperate, even though cooperation leads to a better outcome than both players defecting. Levine uses it to argue against those who suffer from the fallacy of composition and who wonder why we can have war, crime and poverty if people are so rational. But who are these people that Levine is arguing against? I presume not the majority of the behavioural economics profession who are more than familiar with the prisoner’s dilemma game.

Levine’s introduction to the prisoner’s dilemma is good when he discusses what happens with different strategies or game designs. But when it comes to the players in experiments who don’t conform to the Nash equilibrium – such as those who don’t defect in every period if there is a defined end to the game – he hand waves away their play as “rational and altruistic” rather than seriously exploring whether they made systematic errors.

Similarly, when discussing the ultimatum game, Levine simply describes the failure to maximise income as “modest”. He does make the important point that it is rational for first movers to offer more than the minimum if there is a possibility of rejection (and since they don’t have opportunity to learn, they will get this wrong sometimes). But he seems less concerned about the behaviour of player 2 who rejects a material sum. Yes it might be a Nash equilibrium, but the behavioural view might shed some light on why we end up at that particular Nash equilibrium.

Levine is similarly dismissive of the situations where people do make errors in markets. “Behavioural economics focuses on the irrationality of a few people or with people faced with extraordinary circumstances. Given time economists expect these same people will rationally adjust their behaviour to account for new understandings of reality and not simply repeat the same mistakes over and over again.” But given how many major decisions are one-shot decisions with major consequences (purchasing cars, retirement decisions etc), surely they are worth exploring.

One of the more bizarre examples is where Levine addresses the question of why people vote despite having almost no chance of changing the outcome. Levine gives an example of a voting participation game conducted in the lab where he found that participants acted according to the predicted Nash equilibrium, reflecting their costs of voting, the benefits of winning and the probability of their vote swinging the result. But he doesn’t then grapple with the clear problem that this limited experiment doesn’t translate to the real world. Funnily enough, only pages later he cautions “[B]eware also of social scientist [sic] bearing only laboratory results.”

Levine also  brings out the now classic question of why couldn’t economics predict the economic crisis. He points out that crises must be inherently unpredictable as there is an inherent connection between the forecaster and the forecast. If a model that people believed predicted a collapse in the market of 20% next week, the crash would happen today (Let’s ignore for the moment that there seems to be an economist predicting a crash almost every day).

In defence of the economists, Levine pulls out a series of (well cited) papers that he believes already explained the crisis, such as providing for the possibility of sharp crashes and the effect of fools in the market. Look, the shape of the curve by this random paper is the same! But was that actually what happened? Was that the dominant theory? Levine seems to believe mere existence of literature in which crises are present is an indication that the profession is fine, even if that wasn’t a dominant or even widely believed model.

Having spend most this post complaining about Levine’s angle of attack, there are many good points. His discussion of learning theory is interesting – people don’t know all information before they undertake an action and learn along the way. Selfish rationality with imperfect learning does a pretty good job of explaining much behaviour. Some of this throwaway lines also make important points. For example, if a task is unpleasant, it can be rational to leave it to the last moment. Uncertainty can make the procrastination even more rational.

Some of Levine’s critiques of the experimental evidence are also interesting. One I was not aware of was whether the appearance of the endowment effect in some experiments was due to people misunderstanding the Becker-DeGreeot-Marschak elicitation procedure. (People state their willingness to pay or accept and a random draw of the price is made. If the price is lower than the willingness to pay, they pay it.) Levine points to experiments where, if people are trained to understand the procedure, the endowment effect disappears. As I mentioned in a previous post, Levine also points to some interesting literature on anchoring.

Levine closes with a quote from Loewenstein and Ubel that is worth repeating:

… [behavioral economics] has its limits. As policymakers use it to devise programs, it’s becoming clear that behavioral economics is being asked to solve problems it wasn’t meant to address. Indeed, it seems in some cases that behavioral economics is being used as a political expedient, allowing policymakers to avoid painful but more effective solutions rooted in traditional economics.

Behavioral economics should complement, not substitute for, more substantive economic interventions. If traditional economics suggests that we should have a larger price difference between sugar-free and sugared drinks, behavioral economics could suggest whether consumers would respond better to a subsidy on unsweetened drinks or a tax on sugary drinks.

But that’s the most it can do.

Underneath Levine’s critique you sense this is what is really bugging him. Despite the critiques, traditional economic approaches still have a lot of power. And for some people, that seems to have been forgotten along the way.

Replicating anchoring effects

The classic Ariely, Loewenstein, and Prelec experiment (ungated pdf) ran as follows. Students are asked to think of the last two digits of their social security number – essentially a random number – as a dollar price. They are then asked whether they would be willing to buy certain consumer goods for that price or not. Finally, they are asked what is the most they would be willing to pay for each of these goods.

The result was that those with a higher starting price – that is, a higher last two digits on their social security number – were willing to pay more for the consumer goods. That random number “anchored” how much they were willing to pay.

Reading David Levine’s Is Behavioural Economics Doomed? (review to come soon), Levine mentions the following attempted replication:

On the Robustness of Anchoring Effects in WTP and WTA Experiments (ungated pdf)

Drew Fudenberg, David K. Levine, and Zacharias Maniadis

We reexamine the effects of the anchoring manipulation of Ariely, Loewenstein, and Prelec (2003) on the evaluation of common market goods and find very weak anchoring effects. We perform the same manipulation on the evaluation of binary lotteries, and find no anchoring effects at all. This suggests limits on the robustness of anchoring effects.

And from the body of the article:

Our first finding is that we are unable to replicate the results of ALP [Ariely, Loewenstein, and Prelec]: we find very weak anchoring effects both with WTP [willingness to pay] and with WTA [willingness to accept]. The Pearson correlation coefficients between the anchor and stated valuation are generally much lower than in ALP, and the magnitudes of the anchoring effects (as measured by the ratio of top to bottom quintile) are smaller. Repeating the ALP procedure for lotteries we do not find any anchoring effects at all.

Unlike ALP, we carried out laboratory rather than classroom experiments. This necessitated some minor changes—discussed below—from ALP’s procedures. It is conceivable that these changes are responsible for the differences in our findings; if so the robustness of their results is limited.

Our results do not confirm the very strong anchoring effects found in ALP. They are more in agreement with the results of Simonson and Drolet (2004) and Alevy, Landry, and List (2011). Simonson and Drolet (2004) used the same SSN-based anchor as ALP, and found no anchoring effects on WTA, and moderate anchoring effects on WTP for four common consumer goods. Alevy, Landry, and List (2011) performed a field experiment, eliciting the WTP for peanuts and collectible sports cards, and they found no anchoring effects. Bergman et al. (2010) also used the design of ALP for six common goods, and found anchoring effects, but of smaller magnitude than in ALP.

Tufano (2010) and Maniadis, Tufano, and List (2011) also failed to confirm the robustness of the magnitude of the anchoring effects of ALP, using hedonic experiences, rather than common goods. Tufano (2010) used the anchoring manipulation to increase the variance in subjects’ WTA for a bad-tasting liquid, but the manipulation had no effect. Notice that this liquid offers a simple (negative) hedonic experience, like the “annoying sounds” used in Experiment 2 of ALP. Maniadis, Tufano, and List (2011) replicated Experiment 2 of ALP and found weaker (and nonsignificant) anchoring effects. Overall our results suggest that anchoring is real—it is hard to reconcile otherwise the fact that in the WTA treatment with goods the ratios between highest and lowest quintile is always bigger than one—but that quantitatively the effect is small. Additionally our data supports the idea that anchoring goes away when bidding on objects with greater familiarity, such as lotteries.