O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

WeaponsIn her interesting Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O’Neil defines Weapons of Math Destruction based on three criteria – opacity, unfairness and scale.

Opacity makes it hard to assess the fairness of mathematical models (I’ll use the term algorithms through most of this post), and it facilitates (or might even be a key component of) an algorithm’s effectiveness if it relies on naive subjects. “These bonds have been rated by maths geniuses – buy them.” Unfairness relates to whether the algorithm operates in the interest of the modelled subject. Scale is not just that algorithms can affect large numbers of people. Scale can also lead to the establishment of norms that do not allow anyone to escape the operation of the algorithm.

These three factors are common across most of the problematic algorithms O’Neil discusses, and she makes a strong and persuasive case that many algorithms could be developed or used better. But the way she combines many of her points, together with her politics, often makes it unclear what exactly the problem is or what potential solutions could (should) be.

A distinction that might have made this clearer (or at least that I found useful) is between algorithms that don’t do what the developer intends, algorithms working as intended but that have poor consequences for those on the wrong side of their application, and algorithms that have unintended consequences once released into the wild. The first is botched math, the second is math done well to the detriment of others, while the third is good or bad math with naive application.

For this post I am going to break O’Neil’s case into these three categories.

Math done poorly

When it comes to the botched math, O’Neil is at her best. Her tale of teacher scoring algorithms in Washington DC is a case where the model is not helping anyone. Teachers were scored based on the deviations of student test scores from those predicted by models of the students. The bottom 2% to 5% of teachers were fired. But the combination of modelled target student scores and small classrooms made the scoring of teachers little better than random. There was almost no correlation in a teacher’s scores from one year to the next.

Her critique of the way many models are developed is also important. Are we checking the model is working, rather than just assuming that the teachers we fired are the right ones? She contrasts the effort typically spent testing a recidivism model (for use in determining prison sentences) to the way Amazon learns about its customers. Amazon doesn’t simply develop a single “recidivism score” equivalent and take that as determinative. Instead they continue to test and learn as much as they can about their interactions with customers to make the best models they can.

The solutions to the botched math are simpler (at least in theory) than many of the other problems she highlights. The teacher scoring models simply require someone with competence to consider what it is they might want to care about and measure, and if it can be done, work out whether it can be done in a statistically meaningful way. If it can’t, so be it. The willingness to concede that they can’t develop a meaningful model is important if that is the case, particularly if it is designed to inform high-stakes decisions. Similarly, recidivism scoring algorithms should be subject to constant scrutiny.

But this raises the question of how you assess an algorithm. What is the appropriate benchmark? Perfection? Or the system it is replacing? At times O’Neil places a heavy focus on the errors of the algorithm, with little focus on the errors of the alternative – the humans it replaced. Many of O’Neil’s stories involve false positives, leading to a focus on the obvious algorithm errors, with the algorithm’s greater accuracy and the human errors unseen. A better approach might be to simply compare alternative approaches and see which is better, rather than having the human as the default. Once the superior alternative is selected, we also need to remain cognisant that the best option still might not be very good.

As O’Neil argues, some of the poor models would also be less harmful if they were transparent. People could pull the models apart and see whether they were working as intended. A still cleaner version might be to just release the data and let people navigate it themselves (e.g. education data), although this is not without problems. Whatever is the most salient way of sorting and ranking will become the new defacto model. If we don’t do it ourselves, someone will take that data and give us the ranking we crave.

Math done well (for the user anyhow)

When comes to math done well, O’Neil’s three limbs of the WMD definition – opacity, unfairness and scale – are a good description of the problems she sees. O’Neil’s critique is usually not so much about the maths, but the unfair use of the models for purposes such as targeting of the poor (think predatory advertising by private colleges or payday lenders) or treating workers as cogs in the machine through the use of scheduling software.

In these cases, it is common that the person being modelled does not even know about the model (opacity). And if they could see the model, it may be hard to understand what characteristics are driving the outcome (although this is not so different to the opacity of human decision-making). The outcome then determines how we are treated, the ads we see, the prices we see, and so on.

One of O’Neil’s major concerns about fairness is that the models discriminate. She suggests they discriminate against the poor, African-Americans and those with mental illness. This is generally not through a direct intention to discriminate against these groups, although O’Neil gives the example of a medical school algorithm rejecting applicants based on birthplace due to biased training data. Rather, the models use proxies for the variables of interest, and those proxies also happen to correlate with certain group features.

This points to the tension in the use of many of these algorithms. Their very purpose is to discriminate. They are developed to identify the features that, say, employers or lenders want. Given there is almost always a correlation between those features and some groups, you will inevitably “discriminate” against them.

So what is appropriate discrimination? O’Neil objects to tarring someone with group features. If you live in a certain postcode, is it fair to be categorised with everyone else in that postcode? Possibly not. But if you have an IQ that is judged likely to result in poor job performance or creditworthiness based on the past performance of other people with that IQ, is that acceptable? What of having a degree?

The use of features such as postcodes, IQ or degrees come from the need to identify proxies for the traits people want to identify, such as whether they will pay back the loan or deliver good work performance. Each proxy varies in the strength of prediction, so the obvious solution seems to be to get more data and better proxies. Which of these is going to give us the best prediction of what we actually care about?

But O’Neil often balks at this step. She tells the story of a chap who can’t get minimum wage job due to his results on a five-factor model personality test, despite his “near perfect SAT”. The scale of the use of this test means he runs into this barrier with most employers. When O’Neil points out that personality is only one-third as predictive as cognitive tests, she doesn’t make the argument that employers should be allowed to use cognitive tests. She even suggests that employers are rightfully barred from using IQ tests in recruitment (as per a 1971 Supreme Court case). But absent the cognitive tests, do employers simply turn to the next best thing?

Similarly, when O’Neil complains about the use of “e-scores” (proxies for credit scores) in domains where entities are not legally allowed to use credit scores to discriminate, she complains that they are using a “sloppy substitute”. But again she does not complain about the ban on using the more direct measures.

There are also two sides to the use of these proxies. While the use of the proxies may result in some people being denied a job or a loan, it may allow someone else to get that job or loan, or to pay a better price, when a cruder measure might have seen that person being rejected.

O’Neil gives the example of ZestFinance, a payday lender that typically charges 60% lower than the industry standard. ZestFinance does this by finding every proxy for creditworthiness it can, picking out proxies such as correct use of capitalisation on the application form, and whether the applicant read the terms and conditions. O’Neil complains about those who are accepted for a loan but have to pay higher fees because of, say, poor spelling. This is something the poor and uneducated are more likely to incur. But her focus is on one type of outcome, those with more expensive loans (although probably still cheaper than from other payday lenders), leaving those people receiving the cheapest loans unseen. Should we deny this class of people the access to the cheaper finance these algorithms allow?

One interesting case in the book concerns the pricing of car insurance. An insurer wants to know who is the better driver, so they develop algorithms to price the risk appropriately. Credit scores are predictive of driving performance, so those with worse credit scores end up paying more for this.

But insurers also want to price discriminate to the extent that they can. That is, they want to charge each individual the highest price they will tolerate. Price discrimination can be positive for the poor. Price discrimination allows many airlines to offer cheap seats in the back of the plane when the business crowd insists on paying extra for a few inches of leg room. I benefited from the academic pricing of software for years, and we regularly see discounted pricing for students and seniors. But price discrimination can also allow the uninformed, lazy and those without options to be stripped of a few extra dollars. In the case of the insurer pricing algorithms, they are designed to price discriminate in addition to price the policy based on risk.

It turns out that credit score is not just predictive of driving performance, but also of buyer response to price changes. The resultant insurance pricing is an interaction of these two dimensions. O’Neil gives an example from Florida, where adults with clean driving records but poor credit scores paid $1,552 more (on average) than drivers with excellent credit but a drunk driving conviction, although it is unclear how much of this reflects risk and how much price discrimination.

Naive math

One of O’Neil’s examples of a what I will call naive math are those algorithms that create a self-reinforcing feedback loop. The model does what it is supposed to do – say, predict an event – but once used in a system, the model’s classification of a certain cohort becomes self-fulfilling or self-reinforcing.

For example, if longer prison sentences make someone more likely to offend on their release, any indicator that results in longer sentences will in effect become more strongly correlated with re-offending. Even if the model is updated to disentangle this problem, allowing the effect of the longer sentences to be isolated, the person who received a longer sentence is doomed the next time they are scored.

In a sense, the model does exactly what it should, predicting who will re-offend or not, and there is ample evidence that they do better than humans. But the application of the model does more than simply predicting recidivism. It might ultimately affirm itself.

Another example of a feedback loop is a person flagged as a poor credit risk. As they can’t get access to cheap credit, they then go to an expensive payday lender and ultimately run into trouble. That trouble is flagged in the credit scoring system, making it even harder for them to access financial services. If the algorithm made an error in the first instance – the person was actually a good credit risk – that person might then become a poor risk because the model effectively pushed them into more expensive products.

The solutions to these feedback loops are difficult. On the one hand, vigilant investigation and updating the models will help ameliorate the problems. O’Neil persuasively argues that we don’t do this enough. Entities such as ZestFinance that use a richer set of data can also break the cycle for some people.

But it is hard to solve the case for individual mis-classification. Any model will have false positives and false negatives. The model development process can only try to limit them, often with a trade-off between the two.

In assessing this problem we also need to focus on the alternative. Before these algorithms were developed, people would be denied credit, parole and jobs for all sorts of whimsical decisions on the part of the human decision makers. Those decisions would then result in feedback loops as their failures are reflected in future outcomes. The algorithms might be imperfect, but can be an improvement.

This is where O’Neil’s scale point becomes interesting. In a world of diverse credit scoring mechanisms, a good credit risk who is falsely identified as a poor risk under one measure might by accurately classified under another. The false positive is not universal, allowing them to shop around for the right deal. But if every credit provider uses the same scoring system, someone could be universally barred. The pre-algorithm world, for all its flaws, possibly provided more opportunities for someone to find the place where they are not incorrectly classified.

A final point on naive models (although O’Neil has more) is that models reflect goals and ideology. Sometimes this is uncontroversial – we want to keep dangerous criminals off the street. Sometimes this is more complicated – what risk of false positives are we willing to tolerate in keeping those criminals off the street? In many ways the influence of O’Neil’s politics on her critique provide the case in support of this point.

Solutions

Before reading the book, I listened to O’Neil on an Econtalk episode with Russ Roberts. There she makes the point that where we run into flawed algorithms, we shouldn’t always be going back to the old way of doing things (she made that comment in the context of judges). We should be making the algorithms better.

That angle was generally absent from the book. O’Neil takes the occasional moment to acknowledge that many algorithms are not disrupting perfect decision-making systems, but are replacing biased judges, bank managers who favoured their friends, and unstructured job interviews with no predictive power. But through the book she seems quite willing to rip those gains down in the name of fairness.

More explicitly, O’Neil asks whether we should sacrifice efficiency for fairness. For instance, should we leave some data out? In many cases we already do this, by not including factors such as race. But should this extend to factors such as who someone knows, their job or their credit score.

O’Neil’s choice of factors in this instance is telling. She asks whether someone’s connections, job or credit score should be used in a recidivism model, and suggests no as they would be inadmissible in court. But this is a misunderstanding of the court process. Those factors are inadmissible in determining guilt or innocence, but form a central part of sentencing decisions. Look at the use of referees or stories about someone’s tough upbringing. So is O’Neil’s complaint about the algorithm, or the way we dispense criminal justice in general? This reflects a feeling I had many times in the book that O’Neil’s concerns are much deeper than the effect of algorithms and extend to the nature of the systems themselves.

Possibly the point on which I disagree with O’Neil most is her suggestion that human decision-making has a benefit in that it can evolve and adapt. In contrast, a biased algorithm does not adapt until someone fixes it. The simple question I ask is where is the evidence of human adaptation? You just need to look at all the programs to eliminate workplace bias with no evidence of effectiveness for a taste of how hard it is to deliberately change people. We continue to be prone to seeing spurious correlations, and making inconsistent and unreliable decisions. For many human decisions there is simply no feedback loop as to whether we made the right decision. How will a human lender ever know they rejected a good credit risk?

While automated systems are stuck until someone fixes them, someone can fix them. And that is often what happens. Recently several people forwarded to me an article on the inability of some facial recognition systems to recognise non-Caucasian faces. But beyond the point that humans also have this problem (yes, “they all look alike“), the problem with facial recognition algorithms has been identified and, even though it is a tough problem, there are major efforts to fix it. (Considering some of the major customers of this technology are police and security services, there is an obvious interest in solving it.) In the meantime, those of us raised in a largely homogeneous population are stuck with our cross-racial face blindness.

Is it irrational?

Behavioral Scientist OwlOver at Behavioral Scientist magazine my second article, Rationalizing the ‘Irrational’, is up.

In the article I suggest that an evolutionary biology lens can give us some insight into what drives peoples’ actions. By understanding someone’s actual objectives, we are better able to determine whether their actions are likely to achieve their goals. Are they are behaving “rationally”?

Although the major thread of the article is evolutionary, in some ways that is not the main point. For me the central argument is simply that when we observe someone else’s actions, we need to exercise a degree of humility in assessing whether they are “rational”. We possibly don’t even know what they are trying to achieve, let alone whether their actions are the best way to achieve it.

Obviously, this new article pursues a somewhat different theme to my first in Behavioral Scientist, which explored the balance between human and algorithmic decision making. After discussing possible topics for my first article with the editor who has been looking after me to date (DJ Neri), I sent sketches of two potential articles. We decided to progress both.

My plan for my next article is to return to the themes from the first. I’ve recently been thinking and reading about algorithm aversion, and why we resist using superior decision tools when they are available. Even if the best solution is to simply use the algorithm or statistical output, the reality is that people will typically be involved. How can we develop systems where they don’t mess things up?

Kasparov’s Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins

KasparovIn preparation for my recent column in The Behavioral Scientist, which opened with the story of world chess champion Garry Kasparov’s defeat by the computer Deep Blue, I read Kasparov’s recently released Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins.

Despite the title and Kasparov’s interesting observations on the computer-human relationship, Deep Thinking is more a history of man versus machine in chess than a deep analysis of human or machine intelligence. Kasparov takes us from the earliest chess program, produced by Alan Turing on a piece of paper in 1952, through to a detailed account of Kasparov’s 1997 match against the computer Deep Blue, and then beyond.

Kasparov’s history provides an interesting sense of not just the process toward a machine defeating the world champion, but also when computers overtook the rest of us. In 1977 Kasparov had the machines ahead of all but the top 5% of humans. From the perspective of the average human versus machine, the battle is over decades before the machine is better than the best human. And even then the competition at the top levels is brief. As Kasparov puts it, we have:

Thousands of years of status quo human dominance, a few decades of weak competition, a few years of struggle for supremacy. Then, game over. For the rest of human history, as the timeline draws into infinity, machines will be better than humans at chess. The competition period is a tiny dot on the historical timeline.

As Kasparov also discusses, his defeat did not completely end the competition between humans and computers in chess. He describes a 1995 competition in what was called “freestyle chess”, whereby people were free to mix humans and machines as they see fit. To his surprise, the winners of this competition were not a grandmaster teamed with a computer, but a pair of amateur Americans using three computers at the same time. As Kasparov puts it, a weak human + machine + better process is superior to a strong human + machine + inferior process. There is still hope for the humans.

That hope, however, and the human-computer partnership, is also short-lived.  Kasparov notes that the algorithms will continue to improve and the hardware will get faster until the human partnership adds nothing to the mix. Kasparov’s position does not seem that different to my own.

One thing clear through Kasparov’s tale is that he does not consider chess to be the best forum for exploring machine intelligence. This was due to both the nature of chess itself, and the way in which those trying to develop a machine to defeat a world champion (particularly IBM) went about the task.

On the nature of chess, chess is just not complex enough. Its constraints – eight by eight board with sixteen pieces a side – meant that it was amenable to algorithms built using a combination of fixed human knowledge and brute force computational power. From the 1970s onward, developers of chess computers realised that this was the case, so much of the focus was on increasing computational power and refining algorithms for efficiency until they inevitably reach world champion standard.

The nature of these algorithms is best understood in the context of two search techniques described by Claude Shannon in 1949. Type A search is the process of going through every possible combination of moves deeper and deeper with each pass – one move deep, two moves deep and so on. The Type B search is more human-like, focusing on the few most promising moves and examining those in great depth. The development of Type B processes would provide more insight into machine intelligence.

The software that defeated Kasparov, along with most other chess software, used what Kasparov calls alpha-beta search. Alpha-beta search is a Type A approach that stops searching down any particular path whenever a move being examined has a lower value than the currently selected move. This process and increases in computational power were the keys to chess being vulnerable to the brute force attack. Although enormous amounts of work also went into Deep Blue’s openings and evaluation function, another few years would have seen Kasparov or his successor defeated by something far less highly tuned. His defeat was somewhat inevitable.

IBM’s approach to the contest also did not add much to the exploration of machine intelligence. As became clear to Kasparov in the lead up to the Deep Blue rematch (he had defeated Deep Blue in 1996), IBM was not interested in the science behind the enterprise, but simply wanted to win. It provided great advertising for IBM, but the machine logs of the contest were not made available and Deep Blue was later trashed. It’s an interesting contrast to IBM’s approach with Jeopardy winning Watson, which now seems to be everywhere.

As a result, Kasparov sees the AlphaGo project as a more interesting AI project than anything behind the top chess machines. The complexity of Go – a 19 by 19 board and 361 stones – requires the use of techniques such as neural networks. AlphaGo had to teach itself to play.

Even though Kasparov’s offerings on human and machine on intelligence are relatively thin, the chess history in itself makes the book worth reading. Kasparov’s story differs from some of the “myths” that have spread about that contest over the last 20 years, with Kasparov critical of many commentator’s interpretations of events.

One story Kasparov attacks is Nate Silver’s version in The Signal and the Noise (at which time Kasparov also takes a few swings at Silver’s understanding of chess). Silver’s story starts at the conclusion of game 1 of the match. When Kasparov considered his victory near complete, Deep Blue moved a rook in a highly unusual move – a move that turned out the be a “bug” in Deep Blue’s programming. As he did not understand it was a bug, Kasparov saw the move as a sign that the machine could see mate by Kasparov in 20 or more moves, and was seeking to delay this defeat. Kasparov was so impressed by the depth of Deep Blue’s calculations that it affected his play for the rest of the match and was the ultimate cause of his loss.

As Kasparov tells in his version, he simply discarded Deep Blue’s move as the type of inexplicable move computers tend to make when lost. Instead, his state of mind suffered most severely when he was defeated in game 2. Through game 2 he played an unnatural (to him) style of anti-computer chess, and overlooked a potential chance to draw the game through perpetual check (he was informed of his missed opportunity the next day). He simply wasn’t looking for opportunities that he thought a computer would have spotted.

Humans vs algorithms

Behavioral Scientist OwlMy first column over at the Behavioral Scientist is live.

The column is an attempt to bring together two potentially conflicting stories.

The first is that the best decisions result from humans and machines working together. This is encapsulated in the story of freestyle chess, whereby the best software is trumped by a human-computer team.

The other is the deep literature on whether humans or algorithms make better decisions, starting with Paul Meehl’s classic Clinical Versus Statistical Prediction. The common story in this literature is that there are few domains where humans outperform statistical or algorithmic approaches (even relatively simple ones). There is also an admittedly thinner literature on what happens when humans can have the result of the algorithm and decide whether to use or overrule it, and the story there is that people should generally leave the algorithm alone.

If you take the latter to be the usual case, the world will not be so much like freestyle chess, but more a case of steady replacement of humans decision by decision. The humans will remain relevant not because they can improve the algorithm’s decisions, but because there are inputs we need to provide, there are domains the algorithms cannot go yet, or we just don’t want to hand over control.

You can read the column here.

The “effect is too large” heuristic

Daniel Lakens writes:

I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study [by Danziger and friends] on judges handing out harsher sentences before lunch than after lunch. The idea is that their mental resources deplete over time, and they stop thinking carefully about their decision – until having a bite replenishes their resources. The study is well-known, and often (as in the Radiolab episode) used to argue how limited free will is, and how much of our behavior is caused by influences outside of our own control. I had never read the original paper, so I decided to take a look.

During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided upon. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned – it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!

I think we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making.

I was aware of one explanation for why the effect reported by Danziger and friends was so large. Andreas Glockner explored what would occur if favourable rulings took longer than unfavourable rulings, and the judge (rationally) plans ahead and stops for their break if they believe the case will take longer than there is time left in the session. Simulating this scenario, Glockner generated an effect of similar magnitude to the original paper.

However, I was never convinced the case ordering was random, a core assumption behind Danziger and friends’ finding. In my brief legal career I often attended preliminary court hearings where matters were listed in a long (possibly random) court list. Then the order emerged. Those with legal representation would go first. Senior lawyers would get priority over junior lawyers. Matters for immediate adjournment would be early. And so on. There was no formal procedure for this to occur other than discussion with the court orderly before and during the session.

It turns out that these Israeli judges (or, I should say, a panel of a judge, a criminologist and a social worker) experienced a similar dynamic. Lakens points to a PNAS paper in which Keren Weinshall-Margela (of the Israeli Supreme Courts research division) and John Shapard investigated whether the ordering of cases was actually random. The answer was no:

We examined data provided by the authors and obtained additional data from 12 hearing days (n = 227 decisions). We also interviewed three attorneys, a parole panel judge, and five personnel at Israeli Prison Services and Court Management, learning that case ordering is not random and that several factors contribute to the downward trend in prisoner success between meal breaks. The most important is that the board tries to complete all cases from one prison before it takes a break and to start with another prison after the break. Within each session, unrepresented prisoners usually go last and are less likely to be granted parole than prisoners with attorneys.

Danziger and friends have responded to these claims and attempted to resuscitate their article, but here is something to be said for the “effect is too large” heuristic proposed by Lakens. No amount of back and forth about the finer details of the methodology can avoid that point.

The famous story about the effect of defaults on organ donation provides another example. When I first heard the claim that 99.98% of Austrians, but only 12% of Germans, are organ donors due to the default organ donation option in their driver licence renewal, I simply thought the size of the effect was unrealistic. Do only 2 in 10,000 Austrians tick the box? I would assume more that 2 in 10,000 would tick it by mistake, thinking that would make them organ donors. So when you turn to the original paper or examine the actual organ donation process you will see this has nothing to do with driver’s licences or ticking boxes. The claimed effect size and the story simply did not line up.

Andrew Gelman often makes a similar point. Much research in the social sciences reflects an attempt to find tiny effects in noisy data, and any large effects we find are likely gross overestimates of the true effect (to the extent the effect exists). Gelman and John Carlin call this a Type M error.

Finally, I intended to include Glockner’s paper in my critical behavioural economics and behavioural science reading list, but it slipped my mind. I have now included it and these other articles for a much richer story.

A critical behavioural economics and behavioural science reading list

This reading list is a balance to the one-dimensional view in many popular books, TED talks, or conferences. For those who feel they have a good understanding of the literature after reading Thinking Fast and Slow, Predictably Irrational and Nudge, this is for you.

The purpose of this reading list is not to argue that all behavioural economics or behavioural science is bunk (it’s not). It is also not designed to be balanced – you can combine this list with plenty of good reading lists from elsewhere for that (for example).

Please let me know if there are any other books or articles I should add, or if there are any particularly good replies to what I have listed. This list if a first cut based on articles and books I could recall, so I am sure I have missed some good ones.  I have set a mild quality bar on what I have included – I don’t agree with all the arguments, but everything on the list has at least one interesting idea.

Books

Gerd Gigerenzer, Peter Todd and the ABC Research Group’s Simple Heuristics That Make Us Smart: Simple heuristics can be both fast and accurate, particularly when we assess real-life performance rather than conformity with the principles of rationality.

Doug Kenrick and Vlad Griskevicius’s The Rational Animal: How Evolution Made Us Smarter Than We Think: A good introduction to the idea that evolutionary psychology could add a lot of value to behavioural economics, but has the occasional straw man discussion of economics and a heavy reliance on priming research (and you will see below how that is panning out).

David Levine’s Is Behavioural Economics Doomed?: A good but slightly frustrating read. I agree with Levine’s central argument that rationality is underweighted, but the book is littered with straw man arguments.

Phil Rosenzweig’s Left Brain, Right Stuff: How Leaders Make Winning Decisions: An entertaining examination of how behavioural economics findings hold up for real world decision-making.

Gilles Saint-Paul’s The Tyranny of Utility: Behavioral Social Science and the Rise of Paternalism: Sometimes hard to share Saint-Paul’s anger, but some important underlying points.

General and methodological critiques

Nathan Berg and Gerd Gigerenzer’s As-if Behavioral Economics: Neoclassical economics in disguise: “As-if’ arguments are frequently put forward in behavioral economics to justify ‘psychological’ models that add new parameters to fit decision outcome data rather than specifying more realistic or empirically supported psychological processes that genuinely explain these data.” Includes a critique of prospect theory’s lack of realism as a decision-making process (pdf of working paper)

Ken Binmore’s Economic Man – or Straw Man? (pdf): The claim “economic man” is a failure can be both attacking a position not held by economics and ignoring the experimental evidence of people behaving like “economic man”.

Ken Binmore and Avner Shaked’s Experimental economics: Where next? (pdf): “[W]e urge experimentalists to … join the rest of the scientific community in adopting a more skeptical attitude when far-reaching claims about human behavior are extrapolated from very slender data”. See Avner Shaked’s webpage documenting the subsequent debate.

Gerd Gigerenzer debates Daniel Kahneman and Amos Tversky: Gigerenzer tees off (pdf). Kahneman and Tversky respond (pdf – this pdf also includes a rejoinder to Gigerenzer’s later piece). Gigerenzer returns (pdf). I’m a fan of a lot of Gigerenzer’s work, but his strength has never been the direct attack. Kahneman and Tversky get the better of this exchange.

Joseph Henrich, Steven Heine and Ara Norenzayan’s The weirdest people in the world?: “[W]e need to be less cavalier in addressing questions of human nature on the basis of data drawn from this particularly thin, and rather unusual, slice of humanity.”

Owen Jones’s Why Behavioral Economics Isn’t Better, and How it Could Be: “… Behavioral Economics, and those who rely on it, are falling behind with respect to new developments in other disciplines that also bear directly on the very same mysteries of human decision-making.”

Douglas Kenrick and colleagues’ Deep Rationality: The Evolutionary Economics of Decision Making: Many of our biases are in fact deeply rational. (My post).

David Levine and Jie Zheng’s The Relationship Between Economic Theory and Experiments (pdf): “[T]he impression that economic theory has little or no significance for explaining experimental results is misleading. Economic theory makes strong predictions about many situations and is generally quite accurate in predicting behavior in the laboratory. In situations where the theory is thought to fail, the failure is in the application of theory rather than the theory failing to explain the evidence.”

Steven Levitt and John List’s Homo economicus Evolves (pdf): “Economic models can benefit from incorporating insights from psychology, but behavior in the lab might be a poor guide to real-world behavior.”

Steven Levitt and John List’s What Do Laboratory Experiments Measuring Social Preferences Reveal About the Real World?: “[G]reat caution is required when attempting to generalize lab results out of sample: both to other populations and to other situations.”

Pete Lunn and Tim Harford debate Behavioural economics: is it such a big deal?: “[T]he idea that the very foundations of economics are being undermined is absurd.”

The Open Science Collaboration’s Estimating the reproducibility of psychological science (pdf): “Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result.” Social psychology fares particularly poorly.

A note by Ariel Rubinstein on a couple of behavioural economics papers by Colin Camerer and Matthew Rabin: ” For Behavioral Economics to be a revolutionary program of research rather than a passing episode, it must become more open-minded and much more critical of itself.”

Counterpoints to famous biases, effects and stories

Choice overload: Mark Lepper and Sheena Iyengar’s famous jam study (pdf). A meta-analysis by Benjamin Scheibehenne and friends (pdf) – the mean effect size of changing the number of choices across the studies was virtually zero.

The Cornell Food and Brand Lab’s catalogue of eating biases: Jesse Singal catalogues the events. I think it’s fair to say that we shouldn’t place much weight on results coming out of that lab.

Depletion of willpower: Daniel Engber summarises the state of affairs. The meta-analysis referred to by Engber. And the failed replication that triggered the article.

Disfluency: The original N=40 paper (pdf). The N=7000 replication (pdf). Terry Burnham tells the story. (And interestingly, Adam Alter, author of the first paper, suggests that the law of small numbers should be more widely known).

The Florida effect: The poster child for the replication crisis. Ed Yong catalogues the story nicely.

Grit: The book. Daniel Engber reviews. (I like the way Angela Duckworth deals with criticism. Also listen to this Econtalk episode.)

Growth mindset: The Wikipedia summary. The bookScott Alexander’s initial exploration and clarification.

The hot hand illusion: The original Thomas Gilovich, Robert Vallone and Amos Tversky paper arguing people are seeing a hot hand in basketball when none exists. Work by Joshua Miller and Adam Sanjurjo shows the original argument was based on a statistical mistake. The hot hand does exist in basketball. (Although I will say that there is plenty of evidence of people seeing patterns where they don’t exist.) ESPN explores.

Hungry judges: Shai Danziger and friends find that favourable rulings by Israeli parole boards plunge in the lead up to meal breaks (from 65% to near 0). Andreas Glockner suggests this might be a statistical artefact. Keren Weinshall-Margela and John Shapard point out that the hearing order is not random (Danziger and friends respond). And Daniel Lakens suggests we should dismiss the finding as simply being impossible. (My post)

Hyperbolic discounting: Ariel Rubenstein’s “Economics and Psychology”? The Case of Hyperbolic Discounting (pdf) – “[T]he same type of evidence, which rejects the standard constant discount utility functions, can just as easily reject hyperbolic discounting as well.”

Illusion of control: Francesca Gino, Zachariah Sharek and Don Moore’s Keeping the illusion of control under control: Ceilings, floors, and imperfect calibration (pdf) – “[B]y focusing on situations marked by low control, prior research has created the illusion that people systematically overestimate their level of control.” (My post)

Money priming: Doug Rohrer, Harold Pashler and Christine Harris’s Do subtle reminders of money change people’s political views? – A replication finding no efect (pdf). Kathleen Vohs fights back (pdf). Miguel Vadillo, Tom Hardwicke and David R. Shanks respond – Analysis of the broader literature on money priming suggests, among other things, massive publication bias.

Organ donation: Does Austria have a 99.94% organ donation rate because of the design of their driver’s licence application? No.

Overconfidence: Don Moore and Paul Healy’s “The Trouble with Overconfidence” (pdf) – What does someone actually mean when they say “people tend to be overconfident”? (My post)

Power pose: Jesse Singal on Dana Carney’s shift from author of the classic power pose paper (pdf) to skeptic. Carney’s posted a document about her shift on her website.

Priming mating motives: Shanks and friends on Romance, risk, and replication: Can consumer choices and risk-taking be primed by mating motives? (pdf): A failed replication, plus “a meta-analysis of this literature reveals strong evidence of either publication bias or p-hacking.” (I have cited some of these studies approvingly in published work – a mistake.)

Scarcity: The book. My review.  Leandro Carvalho, Stephan Meier and Stephanie Wang’s Poverty and Economic Decision-Making: Evidence from Changes in Financial Resources at Payday – “We find that participants surveyed before and after payday performed similarly on a number of cognitive function tasks.”

Applications of behavioural economics (and nudging)

Philip Booth’s Behavioural economics – a critique of its policy conclusions: “We seem to have gone … to a situation where we have regulators who use economics 101 supplemented with behavioural economics to try to bring perfection to markets that simply cannot be perfected and perhaps cannot be improved.”

John Cochrane’s Homo economicus or homo paleas?: “The case for the free market is not that each individual’s choices are perfect. The case for the free market is long and sorry experience that government bureaucracies are pretty awful at making choices for people.” Noah Smith responds.

Reuben Finighan’s Beyond Nudge: The Potential of Behavioural Policy (pdf): “Policymakers often mistakenly see behavioural policy as synonymous with “nudging”. Yet nudges are only one part of the value of the behavioural revolution—and not even the lion’s share”

Ted Gayer’s Energy efficiency, risk and uncertainty, and behavioral public choice: “[T]he main failure of rationality is not with the energy-using consumers and firms, but instead the main failure of rationality is with the regulators themselves.” And two related papers by Gayer and W. Kip Viscusi: Overriding Consumer Preferences With Energy Regulations (pdf) and Behavioral Public Choice: The Behavioral Paradox of Government Policy (pdf)

Tim Harford on Behavioural Economics and Public Policy: “The appeal of a behavioural approach is not that it is more effective but that it is less unpopular.” (Google the article and go through that link if you hit the paywall.)

George Loewenstein and Peter Ubel’s Economics Behaving Badly: “[B]ehavioral economics is being used as a political expedient, allowing policymakers to avoid painful but more effective solutions rooted in traditional economics.”

If you want some background

I know this list is of critiques, but here the first three books I would recommend if you want a basic background.

Daniel Kahneman’s Thinking, Fast and Slow: Still the best overview of behavioural science. However, it is not standing the test of time particularly well. Here is a fantastic analysis of the priming chapter, and Kahneman’s response to that review in the comments (you can see why Kahneman is a giant in the field).

Richard Thaler’s Misbehaving: A pretty good (although very US-centric) history of behavioural economics.

The Behavioral Foundations of Public Policy: Probably the best book I have read about effective applications (and not the same old stories you always hear).

Behavioral Scientist is live

The folks at ideas42, the Center for Decision Research, and the Behavioral Science and Policy Association have kicked off a new online magazine, The Behavioral Scientist.

I am one of the founding columnists, and it looks like I am part of a pretty good line up. My first column should appear in late July.

You can sign up to the Behavioral Scientist email edition on the homepage, or follow on twitter.

As an aside, I’ve been quiet on the blogging front recently, but contemplating article ideas for The Behavioral Scientist has reminded me how important blogging is for the development of my thinking. So expect to see a higher frequency of posts over the next few months.

Simple Heuristics That Make Us Smart

I have recommended Gerd Gigerenzer, Peter Todd and the ABC Research Group’s  Simple Heuristics That Make Us Smart enough times on this blog that I figured it was time to post a synopsis or review.

After re-reading it for the first time in five or so years, this book will still be high on my recommended reading list. It provides a nice contrast to the increasing use of complex machine learning algorithms for decision making, although it is that same increasing use that makes some parts of the book are seem a touch dated.

The crux of the book is that much human (or other animal) decision making is based on fast and frugal heuristics. These heuristics are fast in that they do not rely on heavy computation, and frugal in that they only search for or use some of the available information.

Importantly, fast and frugal heuristics do not simply trade-off speed for accuracy. They can be both fast and accurate as the tradeoff is between generality versus specificity. The simplicity of fast and frugal heuristics allows them to be robust in the face of environmental change and generalise well to new situations, leading to more accurate predictions for new data than a complex, information-guzzling strategy. The heuristics avoid the problem of overfitting as they don’t assume every detail to be of utmost relevance, and tend to ignore the noise in many cues by looking for the cues that swamp all others.

These fast and frugal heuristics often fail the test of logical coherence, a point often made in the heuristics and biases program kicked off by Kahneman and Tversky. But as Gigerenzer and Todd argue in the opening chapter, pursuing rationality of this nature as an ideal is misguided, as many of our forms of reasoning are powerful and accurate despite not being logically coherent. The function of heuristics is not to be coherent. Their function is to make reasonable adaptive inference with limited time and knowledge.

As a result, Gigerenzer and Todd argue that we should replace the coherence criteria with an assessment of real-world functionality. Heuristics are the way the mind takes advantage of the structure of the environment. They are not unreliable aids used by humans despite their inferior performance.

This assessment of the real-world functionality is also not a general assessment. Heuristics will tend to be domain specific solutions, which means that “ecological rationality” is not simply a feature of the heuristic, but a result of the interaction between the heuristic and the environment.

Bounded rationality

If you have read much Gigerenzer you will have seen his desire to make clear what bounded rationality actually is.

Bounded rationality is often equated with decision making under constraints (particularly in economics). Instead of having perfect foresight, information must be obtained through search. Search is conducted until the costs of search balance the benefits of the additional information.

One of the themes of the first chapter is mocking the idea that decision making under constraints brings us closer to a model of human decision making. Gigerenzer and Todd draw on the example of Charles Darwin, who created a list of the pros and cons of marriage to assist his decision. This unconstrained optimisation problem is difficult. How do you balance children and the charms of female chit chat against the conversation of clever men at clubs?

But suppose a constrained Darwin is starting this list from scratch. He already has two reasons for marriage. Should he try to find another? To understand whether he should continue his search he effectively needs to know the costs and benefits of all the possible third options and understand how each would affect his final decision. He effectively needs to know and consider more than the unconstrained man. You could even go the next order of consideration and look at the costs and benefits of all the cost and benefit calculations, and so on. Infinite regress.

So rather than bounded rationality being decision making under constraints, Gigerenzer argues for something closer to Herbert Simon’s conception, where bounded rationality is effectively adaptive decision making. The mind is computationally constrained, and uses approximations to achieve most tasks as optimal solutions often do not exist or are not tractable (think the relatively simple world of chess). The effectiveness of this approximation is then assessed in the environment in which the mind makes the decisions, resulting in what Gigerenzer terms the “ecological rationality” of the decision.

The recognition heuristic

The first fast and frugal heuristic to be examined in detail in the book is the recognition heuristic. Goldstein and Gigerenzer (the authors of that chapter) define the  recognition heuristic as “If one of two objects is recognized and the other is not, then infer that the recognized object has the higher value.”

The recognition heuristic is frugal as it requires a lack of knowledge to work – a failure to recognise one of the alternatives. The lack of computation required to apply it points to its speed. Goldstein and Gigerenzer argue that the recognition heuristic is a good model for how people actually choose, and present evidence that it is often applied despite conflicting or additional information being available.

Recognition is different from the concept of “availability” developed by Tversky and Kahneman. The availability heuristic works by drawing on the most immediate or recent examples when making an evaluation.  Availability refers to the availability of terms or concepts in memory, whereas recognition relies on the differences between things in and out of memory.

As an example application (and success) of the recognition heuristic, American and German students were asked to compare pairs of German or American cities and select the larger. American students comparing pairs of American cities did worse than Germans on those same American cities – the Americans knew too much to apply the recognition heuristic. The Americans do as well comparing less familiar German cities as they do American cities.

The success of the recognition heuristic results in what could be described as a “less is more” effect. There are situations where decisions based on missing information can be more accurate than those made with more knowledge. There is information implicit in the failure to recognise something.

A second chapter on the recognition heuristic by Borges and friends involves the authors using the recognition heuristic to guide their stock market purchases. They surveyed US and German experts and laypeople about US and German shares and invested based on those that were recognised.

Overall, the authors’ returns beat the aggregate market indices. A German share portfolio based on the recognition of any of the US and German experts or US and German laypeople outperformed the market indices, as did the US stock portfolio based on recognition by Germans. The only group for which recognition delivered lower returns was the US portfolio based on US expert or layperson recognition.

Borges and friends did note that this was a one-off experiment in a bull market, so there is a question of whether it would generalise to other market conditions (or even if it was more than a stroke of luck). But the next chapter took the question of the robustness of simple heuristics somewhat more seriously.

The competition

One of the more interesting chapters in the book is a contest across a terrain of 20 datasets between a fast and frugal heuristic, “take-the-best”, and a couple of other approaches, including the more computationally intensive multiple linear regression. In each of these 20 contests, the competitors were tasked with selecting for all pairs of options which has the highest value. This includes predicting which of two schools had the highest drop out rates, which stretches of highway had the highest accident rates, or which people had the highest body fat percentage.

The take-the-best heuristic works as follows: Choose the cue most likely to distinguish correctly between the two. If the two choices differ on that cue, select the one with the highest value, and end the search. If they are the same, move to the cue with the next highest validity and repeat.

For example, suppose you are comparing the size of two German cities and the best predictor (cue) of size is whether they are a capital city. If neither is a capital city, you then move to the next best cue of whether they have a soccer team. If one does and the other doesn’t, select the city with the soccer team as being the larger.

The general story is that in terms of fitting the full dataset, take-the-best performs well but is narrowly beaten by multiple regression (75% to 77% – although multiple regression was only fed cue direction, not quantitative variables). The closeness across the range of datasets suggests that the power of take the best is not just restricted to one environment.

The story changes more in favour of take-the-best when the assessment shifts to prediction out-of-sample, with multiple regression suffering a severe penalty. Regression accuracy dropped to 68%, whereas take-the-best dropped less to 71%.

There was a model in the competition – the minimalist – which only considered a randomly chosen cue and seeing if it points in one direction or the other. If so, select that choice, otherwise select another cue. The performance of the minimalist suggested frugality can be pushed too far, although it did perform only 3 percentage points below regression in out-of-sample prediction.

The results of the challenge suggests that take-the-best tends not to sacrifice accuracy for its frugality. The relative performance of take-the-best is particularly strong when there is a low number of training examples, with regression having less chance of overfitting in larger environments. Regression tended to perform relatively worse when there were less examples per cue. One point that favoured take-the-best is that the trial didn’t have many large environments. Only two had more than 100 examples, and many had between 10 and 30.

The restriction of regression to use cue direction rather than the quantitative variable also dampened its effectiveness. If able to use quantitative predictors, regression tied take the best on 76% out of sample, even though take-the-best doesn’t use these quantitative values. There was effectively no penalty for the frugality.

A later chapter added to the competition computationally expensive Bayesian models. Bayesians networks won the competition on out-of-sample testing by three percentage points over take-the-best. Again, take-the-best did best relatively when there were small numbers of examples. The more frugal naive Bayes also did pretty well – falling somewhere between the two approaches.

The results suggest that each approach has its place. Use fast and frugal approaches when you need to be quick with low numbers of examples, and use Bayesian approaches when have time, computational power and knowledge. This is where some of the examples start to feel dated when the size of the datasets in many domains is rapidly growing in combination with cheaper computational power.

This dated feel is even more apparent in the competition between another heuristic, categorisation by elimination, and neural networks across 3 datasets.

Categorisation by elimination is a classification algorithm that walks through examples and cues, starting from the cue with the highest probability of success. If the example can be categorised, categorise it and move to the next example. If not, move to the next cue, with possible categories limited to those possible given earlier cues. Repeat until classified.

In measured performance, categorisation by elimination was only a few percentage points behind neural networks, although the datasets contained only 150, 178 and 8124 examples. The performance of neural networks also capped out at 100% on the largest mushroom dataset (not bad when picking what should eat and consequences) and 94 and 96% on the other two. There wasn’t much room for a larger victory.

A couple of the chapters are also just a touch too keen to show the effectiveness of the simple heuristics. This was one such case. An additional competition was run giving neural networks only a limited number of cues, in which case its performance plunges. But these cues were chosen based on the number of cues used by categorisation by elimination, rather than a random selection.

The 37% rule

One interesting chapter is on the “secretary problem” and the resulting 37% rule. The basic idea is that you have a series of candidates you are interviewing for the role of secretary (this conception of the problem spread in the 1950s). You view each candidate one by one and must decide on the spot if you will stop your search there and hire the candidate in front of you. If you move to the next candidate, the past candidate is gone forever.

To maximise your probability of finding the best secretary, you should view 37% of the candidates without making any choice, and then accept the next candidate who is better than all you have seen to date. This rule gives (coincidentally) a 37% chance of ending up with the best mate.

But this rule is not without risks. If the best candidate was in that first 37%, you will end up with the last person you see, effectively a random person from the population. So there is effectively a 37% chance of a random choice. Because of that random choice, the 37% rule leaves you with a 9% chance you will end up with someone in the bottom quartile.

But what if, like most people, you have a degree of risk aversion – particularly if you are applying the rule to serious questions such as mate choice. Suppose there are 100 candidates and you want someone out of the top 10%. In that case you only want to look at the first 14% of candidates and choose the next candidate who is better than all previous candidates. That gives you an 83% chance of a top 10% candidate. If you will settle for the top 25%, you only need look at the first 7% for a 92% chance of getting someone in the top quartile.

In larger populations, you need to look at even less. With 1000 people, you need only look at only 3% of the candidates to maximise chance of top 10% at 97% probability. For a top 25% mate, you should only check out 1 to 2%.

The net result is that the 37% rule sets aspirations too high unless you will settle for nothing but the best. It is less robust than other rules.

This exploration points to the potential for a simple search heuristic. Try a dozen will generally outperform the 37% rule across most population sizes for getting a good but not perfect mate. Try a few dozen is a great rule for someone in New York who wants close to the best.

Then there is the issue that the success of the 37% rule depends on your own value. On finding the mate you will finally propose to, what is the probability that the two-sided choice will end up with them saying yes? In domains such as mate choice, only one or two people could get away with applying that rule – and that leads to a whole new range of considerations.

Odds and ends

The book is generally interesting throughout. Here are a few odds and ends:

  • One chapter argues that the hindsight bias is the product of fast and frugal approach to recalling decisions. We update knowledge when it is received. If we cannot recall the original decision, we can approximate it by going through the same process as used to generate the decision last time. But if we have updated our knowledge, we get a new answer.
  • As mentioned, some chapters are a bit out of date. One chapter is on using heuristics to predict intention from motion. I expect neural networks will likely be in another league on domains such as this compared to when the book was written.
  • Another chapter is on investment in offspring. Heuristics such as invest in the oldest do almost as well as the optimal investment rules developed by Becker, despite their lack of relative complexity. The best rule for a particular time will depend on the harshness of the environment.

Domingos’s The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

My view of Pedro Domingos’s The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World depends on which part of the book I am thinking about.

The opening and the close of the book verge on techno-Panglossianism. The five chapters on the various “tribes” of machine learning, plus the chapter on learning without supervision, are excellent. And I simply don’t have the knowledge to judge the value of Domingos’s reports on his own progress to the master algorithm.

Before getting to the details, The Master Algorithm is a book on machine learning. Machine learning involves the development of algorithms that can learn from data. Domingos describes it as computers programming themselves, but I would prefer to describe it as humans engaging in a higher level of programming. Give the computer some data and the objective, provide a framework for developing the solution (each of Domingos’s tribes has a different approach to this), and let the computer develop it.

Machine learning’s value is becoming more apparent with increasing numbers of problems involving “big data” and mountains of variables that cannot be feasibly be incorporated into explicitly designed programs. Tasks such as predicting the tastes of Amazon’s customers or deciding which updates to show each Facebook user are effectively intractable given the millions of choices available. In response, the Facebooks and Amazons of the world are designing learning algorithms that can use the massive amounts of data available to attempt to determine what their customers or users want.

Similarly, explicitly programming a self-driving car for every possible scenario is not feasible. But train it on massive amounts of data and it can learn to drive itself.

The master algorithm of the book’s title is a learning algorithm that can be used across all domains. Today there are five tribes (as categorised by Domingos), each with their own master algorithm. The ultimate master algorithm combines them into a general purpose learning machine.

The first tribe, the symbolists, believe in the power of logic. Their algorithms build sets of rules that can classify the examples in front of it. Induction, or as Domingos notes, inverse deduction, can be used to generate further rules to fill in the gaps.

To give the flavour of this approach, suppose you are trying to find the conditions under which certain genes are expressed. You run a series of experiments and your algorithm generates an initial set of rules from the results.

If the temperature is high, gene A is expressed.

If the temperature is high, genes B and D are not expressed.

If gene C is expressed, gene D is not.

Gaps in these rules can then be filled in by inverse deduction. From the above, the algorithm might induce If gene A is expressed and gene B is not, gene C is expressed. This could then be tested in experiments and possibly form the basis for further inductions. These rules are then applied to new examples to predict whether the gene will be expressed or not.

One tool in the symbolist toolbox is the decision tree. Start at the first rule, and go down the branch pointed to by the answer. Keep going until you reach the end of a branch. Considering massive bodies of rules together is computationally intensive, but the decision tree saves on this by ordering the rules and going through them one-by-one until you get the class. (This also solves the problem of conflicting rules.)

The second tribe are the connectionists. The connectionists take their inspiration from the workings of the brain. Similar to the way that connections between neurons in our brain are shaped by experience, the connectionists build a model of neurons and connect them in a network. The strength of the connections between the neurons is then determined by training on the data.

Of the tribes, the connectionists could be considered to be in the ascendency at the moment. Increases in computational power and data have laid the foundations for the success of their deep learning algorithms – effectively stacks or chains of connectionist networks – in applications such as image recognition, natural language processing and driving cars.

The third tribe are the evolutionaries, who use the greatest algorithm on earth as their inspiration. The evolutionaries test learning algorithms by their “fitness”, a scoring function as to how well the algorithm meets its purpose. The fitter algorithms are more likely to live. The successful algorithms are then mutated and recombined (sex) to produce new algorithms that can continue the competition for survival. Eventually an algorithm will find a fitness peak where further mutations or recombination do not increase the algorithm’s success.

A major contrast with the connectionists is the nature of evolutionary progress. Neural networks start with a predetermined structure. Genetic algorithms can learn their structure (although a general form would be specified). Backpropogation, the staple process by which neural networks are trained, starts from an initial random point for a single hypothesis but then proceeds deterministically in steps to the solution. A genetic algorithm has a sea of hypotheses competing at any one moment, with the randomness of mutation and sex potentially producing big jumps at any point, but also generating many useless algorithm children.

The fourth tribe are the Bayesians. The Bayesian’s start with a set of hypotheses that could be used to explain the data, each of which has a probability of being true (their ‘priors’). Those hypotheses are then tested against the data, with those hypotheses that better explain the data increasing in their probability of being true, and those that can’t decreasing in their probability. This updating of the probability is done through Bayes’ Rule. The effective result of this approach is that there is always a degree of uncertainty – although often the uncertainty relating to improbable hypotheses is negligible.

This Bayesian approach is typically implemented through Bayesian networks, which are arrangements of events that each have specified probabilities and conditional probabilities (the probability that an event will occur conditional on another event or set of events occurring). To prevent explosions in the number of probability combinations required to specify a network, assumptions about the degree of independence between events are typically made. Despite these possibly unrealistic assumptions, Bayesian networks can still be quite powerful.

The fifth and final tribe are the analogisers, who, as the name suggests, reason by analogy. Domingos suggests this is perhaps the loosest tribe, and some members might object to being grouped together, but he suggests their common reliance on similarity justifies their common banner.

The two dominant approaches in this tribe are nearest neighbour and support vector machines. Domingos describes nearest neighbour as a lazy learner, in that there is no learning process. The work occurs when a new test example arrives and it needs to be compared across all existing examples for similarity. Each data point (or group of data points for k-nearest neighbour) is its own classifier, in that the new example is classified into the same class as that nearest neighbour. Nearest neighbour is particularly useful in recommender systems such as those run by the Netflixes and Amazons of the world.

Support vector machines are a demonstration of the effectiveness of gratuitously complex models. Support vector machines classify examples by developing boundaries between the positive and negative examples, with a specified “margin” of safety between the examples. They do this by mapping the points into a hyper-dimensional space and developing boundaries that are straight lines. The examples along the margin are the “support vectors”.

Of Domingos’s tribes, I feel a degree of connection to them all. Simple decision trees can be powerful decision tools, despite their simplicity (or possibly because of it). It is hard not to admire the progress of the connectionists in recent years in not just technical improvement but also practical applications in areas such as medical imaging and driverless cars. Everyone seems to be a Bayesian nowadays (or wants to be), including me. And having played around with support vector machines a bit, I’m both impressed and perplexed by their potential.

From a machine learning perspective, it is the evolutionaries I feel possibly the least connection with. Despite my interest and background in evolutionary biology, it’s the one group I haven’t seen practically applied in any of the domains I operate. I’ve read a few John Holland books and articles (Holland being one of the main protagonists in the evolutionary chapter) and always appreciate the ideas, but have never felt close to the applications.

Outside of the chapters on the five tribes, Domingos’s Panglossianism grates, but it is relatively contained to the opening and closing of the book. In Domingos’s view, the master algorithm will make stock market crashes fewer and smaller, and the play of our personal algorithms with everyone else’s will make our lives happier, longer and more productive. Every job will be better than it is today. Democracy will work better because of higher bandwidth communication between voters and politicians.

But Domingos’s gives little thoughts to what occurs where people have different algorithms, different objectives, different data they have trained their algorithm on and, in effect, different beliefs. Little thought is given to the complex high-speed interaction of these algorithms.

There are a few other interesting threads in the books worth highlighting. One is the idea that you need bias to learn. If you don’t have preconceived notions of the world, you could conceive of a world where everything you haven’t seen is the opposite of what you predict (known as the ‘No free lunch theorem’).

Another is the idea that once computers get to a certain level of advancement, the work of scientists will largely be trying to understand the outputs of computers rather than generate the outputs themselves.

So all up, a pretty good read. For a snapshot of the book, the Econtalk episode featuring Domingos is (as usual) excellent.

Coursera’s Executive Data Science Specialisation: A Review

As my day job has shifted toward a statistics and data science focus, I’ve been reviewing a lot of online materials to get a feel for what is available – both for my learning and to see what might be good training for others.

One course I went through was Coursera’s Executive Data Science Specialisation, created by John Hopkins University. Billed as the qualification to allow you to run a data science team, it is made up of five “one week” courses covering the basics of data science, building data science teams and managing data analysis processes.

There are some goods parts to the courses, but unlike the tagline that you will learn what you need to know “to begin assembling and leading a data science enterprise”, it’s some way short of that benchmark. For managers who have data scientists sitting under them, or who use a data science team in their organisation, it might give them a sense of what is possible and an understanding of how data scientists think. But it is not much more than that.

If I were to recommend any part of the specialisation, it would be the third and fourth courses – Managing Data Analysis and Data Science in Real Life (notes below). They offer a better crash course in data science than the first unit, A Crash Course in Data Science, and might help those unfamiliar with data science processes to understand how to think about statistical problems. That said, someone doing them with zero statistical knowledge will likely find themselves lost.

With Coursera’s subscription option you can subscribe to the specialisation for $50 or so per month, and smash through all five units in a few days (as I did, and you could do it in one day if you had nothing else on). From that perspective, it’s not bad value – although the only material change through paying versus auditing is the ability to submit the multiple choice quizzes. Otherwise, just pick videos that look interesting.

Here’s a few notes on the five courses:

  1. A Crash Course in Data Science: Not bad, but likely too shallow to give someone much feeling about data science. The later units provide a better crash course for managers as they focus on methodology and practice rather than techniques.
  1. Building a Data Science Team: Some interesting thoughts on the skills required in a team, but the material on managing teams and communication was generic.
  1. Managing Data Analysis: A good crash course in data science – better than the course with that title. Walks through the data science process.
  1. Data Science in Real Life: Another good crash course in data science, although you will likely need some statistical background to fully benefit. A reality check on how the data science process is likely to go relative to the perfect scenario.
  1. Executive Data Science Capstone: You appreciate the effort that went into producing an interactive “choose your own adventure”, but the entire effort was around half a dozen decisions in less than an hour.