Uncategorized

The benefit of doing nothing

From Tim Harford:

[I]n many areas of life we demand action when inaction would serve us better.

The most obvious example is in finance, where too many retail investors trade far too often. One study, by Brad Barber and Terrance Odean, found that the more retail investors traded, the further behind the market they lagged: active traders underperformed by more than 6 percentage points (a third of total returns) while the laziest investors enjoyed the best performance.

This is because dormant investors not only save on trading costs but avoid ill-timed moves. Another study, by Ilia Dichev, noted a distinct tendency for retail investors to pile in when stocks were riding high and to sell out at low points. …

The same can be said of medicine. It is a little unfair on doctors to point out that when they go on strike, the death rate falls. Nevertheless it is true. It is also true that we often encourage doctors to act when they should not. In the US, doctors tend to be financially rewarded for hyperactivity; everywhere, pressure comes from anxious patients. Wiser doctors resist the temptation to intervene when there is little to be gained from doing so — but it would be better if the temptation was not there. …

Harford also reflects on the competition between humans and computers, covering similar territory to that in my Behavioral Scientist article Don’t Touch the Computer (even referencing the same joke).

The argument for passivity has been strengthened by the rise of computers, which are now better than us at making all sorts of decisions. We have been resisting this conclusion for 63 years, since the psychologist Paul Meehl published Clinical vs. Statistical Prediction. Meehl later dubbed it “my disturbing little book”: it was an investigation of whether the informal judgments of experts could outperform straightforward statistical predictions on matters such as whether a felon would violate parole.

The experts almost always lost, and the algorithms are a lot cleverer these days than in 1954. It is unnerving how often we are better off without humans in charge. (Cue the old joke about the ideal co-pilot: a dog whose job is to bite the pilot if he touches the controls.)

The full article is here.

Alter’s Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching

I have a lot of sympathy for Adam Alter’s case in Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching. Despite the abundant benefits of being online, the hours I have burnt over the last 20 years through aimless internet wandering and social media engagement could easily have delivered a book or another PhD.

It’s unsurprising that we are surrounded by addictive tech. Game, website and app designers are all designing their products to gain and hold our attention. In particular, the tools at the disposal of modern developers are fantastic at introducing what Alter describes as the six ingredients of behavioural addition:

[C]ompelling goals that are just beyond reach; irresistible and unpredictable positive feedback; a sense of incremental progress and improvement; tasks that become slowly more difficult over time; unresolved tensions that demand resolution; and strong social connections.

Behavioural addictions have a lot of similarity with substance addictions (some people question whether we should distinguish between them at all). They activate the same brain regions. They are fueled by some of the same human needs, such as the need for social engagement and support, mental stimulation and a sense of effectiveness. [Parts of the book seem to be a good primer on addiction, although see my endnote.]

Based on one survey of the literature, as many as 41 per cent of the population may have suffered a behavioural addiction in the past month. While having so many people classified as addicts dilutes the concept of “addiction”, it does not seem unrealistic given the way many people use tech.

As might be expected given the challenge, Alter’s solutions on how we can manage addiction in the modern world fall somewhat short of providing a fix. For one, Alter suggests we need to start training the young when they are first exposed to technology. However, it is likely that the traps present in later life will be much different from those present when young. After all, most of Alter’s examples of addicts were born well before the advent of World of Warcraft, the iPhone or the iPad that derailed them.

Further, the ability of tech to capture our attention is only in its infancy. It is not hard to imagine the eventual creation of immersive virtual worlds so attractive that some people will never want to leave.

Alter’s chapter on gamification is interesting. Gamification is the idea of turning a non-game experience into a game. One of the more inane but common examples of gamification is turning a set of stairs into a piano to encourage people to take those stairs in preference to the neighbouring escalator (see on YouTube). People get more exercise as a result.

The flip side is that gamification is part of the problem itself (unsurprising given the theme of Alter’s book). For example, exercise addicts using wearables can lose sight of why they are exercising. They push on for their gamified goals despite injuries and other costs. One critic introduced by Alter is particularly scathing:

Bogost suggested that gamification “was invented by consultants as a means to capture the wild, coveted beast that is video games and to domesticate it.” Bogost criticized gamification because it undermined the “gamer’s” well-being. At best, it was indifferent to his well-being, pushing an agenda that he had little choice but to pursue. Such is the power of game design: a well-designed game fuels behavioral addiction. …

But Bogost makes an important point when he says that not everything should be a game. Take the case of a young child who prefers not to eat. One option is to turn eating into a game—to fly the food into his mouth like an airplane. That makes sense right now, maybe, but in the long run the child sees eating as a game. It takes on the properties of games: it must be fun and engaging and interesting, or else it isn’t worth doing. Instead of developing the motivation to eat because food is sustaining and nourishing, he learns that eating is a game.

Taking this critique further, Alter notes that “[c]ute gamified interventions like the piano stairs are charming, but they’re unlikely to change how people approach exercise tomorrow, next week, or next year.” [Also read this story about Bogost and his game Cow Clicker.]

There are plenty of other interesting snippets in the book. Here’s one on uncertainty of reward:

Each one [pigeon] waddled up to a small button and pecked persistently, hoping that it would release a tray of Purina pigeon pellets. … During some trials, Zeiler would program the button so it delivered food every time the pigeons pecked; during others, he programmed the button so it delivered food only some of the time. Sometimes the pigeons would peck in vain, the button would turn red, and they’d receive nothing but frustration.

When I first learned about Zeiler’s work, I expected the consistent schedule to work best. If the button doesn’t predict the arrival of food perfectly, the pigeon’s motivation to peck should decline, just as a factory worker’s motivation would decline if you only paid him for some of the gadgets he assembled. But that’s not what happened at all. Like tiny feathered gamblers, the pigeons pecked at the button more feverishly when it released food 50–70 percent of the time. (When Zeiler set the button to produce food only once in every ten pecks, the disheartened pigeons stopped responding altogether.) The results weren’t even close: they pecked almost twice as often when the reward wasn’t guaranteed. Their brains, it turned out, were releasing far more dopamine when the reward was unexpected than when it was predictable.

I have often wondered to what extent surfing is attractive due to the uncertain arrival of waves during a session, or the inconsistency in swell from day-to-day.

———

Now for a closing gripe. Alter tells the following story:

When young adults begin driving, they’re asked to decide whether to become organ donors. Psychologists Eric Johnson and Dan Goldstein noticed that organ donations rates in Europe varied dramatically from country to country. Even countries with overlapping cultures differed. In Denmark the donation rate was 4 percent; in Sweden it was 86 percent. In Germany the rate was 12 percent; in Austria it was nearly 100 percent. In the Netherlands, 28 percent were donors, while in Belgium the rate was 98 percent. Not even a huge educational campaign in the Netherlands managed to raise the donation rate. So if culture and education weren’t responsible, why were some countries more willing to donate than others?

The answer had everything to do with a simple tweak in wording. Some countries asked drivers to opt in by checking a box:

If you are willing to donate your organs, please check this box: □

Checking a box doesn’t seem like a major hurdle, but even small hurdles loom large when people are trying to decide how their organs should be used when they die. That’s not the sort of question we know how to answer without help, so many of us take the path of least resistance by not checking the box, and moving on with our lives. That’s exactly how countries like Denmark, Germany, and the Netherlands asked the question—and they all had very low donation rates.

Countries like Sweden, Austria, and Belgium have for many years asked young drivers to opt out of donating their organs by checking a box:

If you are NOT willing to donate your organs, please check this box: □

The only difference here is that people are donors by default. They have to actively check a box to remove themselves from the donor list. It’s still a big decision, and people still routinely prefer not to check the box. But this explains why some countries enjoy donation rates of 99 percent, while others lag far behind with donation rates of just 4 percent.

This story is rubbish, as I have posted about here, here, here and here. This difference has nothing to do with ticking boxes on driver’s licence forms. In Austria they are never even asked. 99 per cent of Austrians aren’t organ donors in the way anyone would normally define it. 99% are presumed to consent, and if they happen to die their organs might not be taken because the family objects (or whatever other obstacle gets in the way) in the absence of any understanding of the actual intentions of the deceased.

To top it off, Alter embellishes the incorrect version of the story as told by Daniel Kahneman or Dan Ariely with phrasing from driver’s licence forms that simply don’t exist. Did he even read the Johnson and Goldstein paper (ungated copy)?

After reading a well-written and entertaining book about a subject I don’t know much about, I’m left questioning whether this is a single slip or Alter’s general approach to his writing and research. How many other factoids from the book simply won’t hold up once I go to the original source?

Rats in a casino

From Adam Alter’s Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching:

Juice refers to the layer of surface feedback that sits above the game’s rules. It isn’t essential to the game, but it’s essential to the game’s success. Without juice, the same game loses its charm. Think of candies replaced by gray bricks and none of the reinforcing sights and sounds that make the game fun. …

Juice is effective in part because it triggers very primitive parts of the brain. To show this, Michael Barrus and Catharine Winstanley, psychologists at the University of British Columbia, created a “rat casino.” The rats in the experiment gambled for delicious sugar pellets by pushing their noses through one of four small holes. Some of the holes were low-risk options with small rewards. One, for example, produced one sugar pellet 90 percent of the time, but punished the rat 10 percent of the time by forcing him to wait five seconds before the casino would respond to his next nose poke. (Rats are impatient, so even small waits register as punishments.) Other holes were high-risk options with larger rewards. The riskiest hole produced four pellets, but only 40 percent of the time—on 60 percent of trials, the rat was forced to wait in time-out for forty seconds, a relative eternity.

Most of the time, rats tend to be risk-averse, preferring the low-risk options with small payouts. But that approach changed completely for rats who played in a casino with rewarding tones and flashing lights. Those rats were far more risk-seeking, spurred on by the double-promise of sugar pellets and reinforcing signals. Like human gamblers, they were sucked in by juice. “I was surprised, not that it worked, but how well it worked,” Barrus said. “We expected that adding these stimulating cues would have an effect. But we didn’t realize that it would shift decision making so much.”

I’ll post some other thoughts on the book later this week.

Ip’s Foolproof: Why Safety Can Be Dangerous and How Danger Makes Us Safe

Greg Ip’s framework in Foolproof: Why Safety Can Be Dangerous and How Danger Makes Us Safe is the contrast between what he calls the ecologists and engineers. Engineers seek to use the sum of our human knowledge to make us safer and the world more stable. Ecologists recognise that the world is complex and that people adapt, meaning that many of our solutions will have unintended consequences that can be worse than the problems we are trying to solve.

Much of Ip’s book is a catalogue of the failures of engineering. Build more and larger levees, and people will move into those flood protected areas. When the levees eventually fail, the damage is larger than it would otherwise have been. There is a self reinforcing link between flood protection and development, ensuring the disasters grow in scale.

Similarly, if you put out every forest fire as soon as it pops up, eventually a large fire will get out of control and take advantage of the build up in fuel that occurred due to the suppression of the earlier fires.

Despite these engineering failures, there is often pressure for regulators or those with responsibility to keep us safe to act as engineers. In Yellowstone National Park, the “ecologists” had taken the perspective that fires did not have to be suppressed immediately, as in combination with prescribed burning they could reduce the build up of fuel. But the economic interests around Yellowstone, largely associated with tourism, fought this use of fire. After all, prescribed burning and letting fires burn for a while is not costless or risk free. But the build up of fuel from failure to bear those short term costs or risks, as much of the pressure was on them to do, results in the long-term risk of a massive fire.

Despite the problems with engineers, Ip suggests we need to take the best of both the engineering and ecologist approaches in addressing safety. Engineers have made car crashes more survivable. Improved flood protection allows us to develop areas that were previously out of reach. What we need to do, however, is not expect too much of the engineers. You cannot eliminate risks and accidents. Some steps to do so will simply shift, change or exacerbate the risk.

One element of Ip’s case for retaining parts of the engineering approach is confidence. People need a degree of confidence or they won’t take any risks. There are many risks we want people to take, such as starting a business or trusting their money with a bank. The evaporation of confidence can be the problem itself, so if you prevent the loss of confidence, you don’t actually need to deploy the safety device. Deposit insurance is the classic example.

Ip ultimately breaks down the balance of engineering and ecology to a desire to maximise the units of innovation per unit of instability. An acceptance of instability is required for people to innovate. This could be through granting people the freedom to take risks, or by creating an impression of safety (and a degree of moral hazard – the taking of risks when the costs are not borne by the risk taker) to retain confidence.

Despite being an attempt to balance the two approaches, the innovation versus instability formula sounds much like what an engineer might suggest. I agree with Ip that the simple ecologist solution of removing the impression of safety to expunge moral hazard is not without costs. But it is not clear to me that you would ever get this balance right through design. Part of the appeal of the ecologist approach is the acceptance of the complexity of these systems and an acknowledgement to the limits of our knowledge about them.

Another way that Ip frames his balanced landing point is that we should accept small risks and the benefits, and save the engineering for the big problems. Ip hints at, but does not directly get to, Taleb’s concept of anti-fragility in this idea. Antifragility would see us develop a system where those small shocks strengthen the system and not simply being a cost we incur to avoid moral hazard.

The price of risk

Some of Ip’s argument is captured by what is known as the Peltzman effect, named after University of Chicago economist Sam Peltzman. Peltzman published a paper in 1975 examining the effect of safety improvements in cars over the previous 10 years. Peltzman found a reduction in deaths per mile travelled for vehicle occupants, but also an increase in pedestrian injuries and property damage.

Peltzman’s point was that risky driving has a price. If safety improvements reduce that price, people will take more risk. The costs of that additional risk can offsett the safety gains.

While this is in some ways an application of basic economics – make something cheaper and people will consume more – the empirical evidence on the Peltzman effect is interesting.

On one level, it is obvious that the Peltzman effect does not make all safety improvements a waste of effort. The large declines in driver deaths relative to the distance travelled over the last 50 years, without fully offsetting pedestrian deaths or other damage, establishes this case.

But when you look at individual safety improvements, there are some interesting outcomes. In the case of seat belts, empirical evidence suggests the absence of the Peltzman effect. For example, one study looked at the effects across states as each introduced seatbelt laws and found a decrease in deaths but no increase in pedestrian fatalities.

In contrast, anti-lock brakes were predicted to materially reduce crashes, but the evidence suggests effectively no net change. Drivers with anti-lock brakes drive faster and brake harder. While reducing some risks – less front-end collisions – they increase others – such as the increased rear end collisions induced by their hard braking behaviour.

So why the difference between seatbelts and anti-lock brakes? Ip argues that the difference depends on what the safety improvement allows us to do and how it feeds back into our behaviour. Anti-lock brakes give a driver with a feeling of control and a belief they can drive faster. This belief is correct, but occasionally it backfires and they have an accident they would not have had otherwise. With seatbelts, most people want to avoid a crash and a car crash remains unpleasant even when wearing a seatbelt. At many times the seatbelt is not even in people’s minds.

Irrational risk taking?

One of the interesting threads through the book (albeit one that I wish Ip had explored in more detail) is the mix of rational and irrational decision making in our approach to risk.

Much of this “irrationality” concerns our myopia. We rebuild on sites where hurricanes and storms have swept away or destroyed the previous structures. The lack of personal experience with the disaster leads people to underweight the probability. We also have short memories, with houses built immediately after a hurricane being more likely to survive the next hurricane than those built a few years later.

A contrasting effect is our fear response to vivid events, which leads us to overweight them in our decision making despite the larger costs of the alternative.

But despite the ease in spotting these anomalies, for many of Ip’s real world examples of individual actions that might by myopic or irrational it wouldn’t be hard to craft an argument that the individual might be making a good decision. If the previous building on the site was destroyed by a hurricane, can you still get flood insurance (possibly subsidised), making it a good investment all the same? As Ip points out, there are also many benefits to living in disaster prone areas, which are often sites of great economic opportunity (such as proximity to water).

In a similar vein, Ip points to the individual irrationality of “overconfident” entrepreneurs, whose businesses will more often than not end up failing. But as catalogued by Phil Rosenzweig, the idea that these “failed” businesses generally involve large losses is wrong. Overconfident is a poor word to describe these entrepreneurs’ actions (see also here on overconfidence).

I have a other few quibbles with the book. One was when Ip’s discussion of our response to uncertainty conflated risk aversion with loss aversion, the certainty effect and the endowment effect. But as I say, they are just quibbles. Ip’s book is well worth the read.

Does presuming you can take a person’s organs save lives?

I’ve pointed out several times on this blog the confused story about organ donation arising from Johnson and Goldstein’s Do Defaults Save Lives? (ungated pdf). Even greats such as Daniel Kahneman are not immune from misinterpreting what is going on.

Again, here’s Dan Ariely explaining the paper:

One of my favorite graphs in all of social science is the following plot from an inspiring paper by Eric Johnson and Daniel Goldstein. This graph shows the percentage of people, across different European countries, who are willing to donate their organs after they pass away. …

But you will notice that pairs of similar countries have very different levels of organ donations. For example, take the following pairs of countries: Denmark and Sweden; the Netherlands and Belgium; Austria and Germany (and depending on your individual perspective France and the UK). These are countries that we usually think of as rather similar in terms of culture, religion, etc., yet their levels of organ donations are very different.

So, what could explain these differences? It turns out that it is the design of the form at the DMV. In countries where the form is set as “opt-in” (check this box if you want to participate in the organ donation program) people do not check the box and as a consequence they do not become a part of the program. In countries where the form is set as “opt-out” (check this box if you don’t want to participate in the organ donation program) people also do not check the box and are automatically enrolled in the program. In both cases large proportions of people simply adopt the default option.

Johnson and Goldstein (2003) Organ donation rates in Europe

I keep hearing this story in new places, so it’s clearly got some life to it (and I’ll keep harping on about it). The problem is that there is no DMV form. These aren’t people “willing” to donate their organs. And a turn to the second page of Johnson and Goldstein’s paper makes it clear that the translation from “presumed consent” to donation appears mildly positive but is far from direct. 99.98% of Austrians (or deceased Austrians with organs suitable for donation) are not organ donors.

Although Johnson and Goldstein should not be blamed for the incorrect stories arising from their paper, I suspect their choice of title – particularly the word “default” – has played some part in allowing the incorrect stories to linger. What of an alternative title “Does presuming you can take a person’s organs save lives?”

One person who is clear on the story is Richard Thaler. In his surprisingly good book Misbehaving (I went in with low expectations after reading some reviews), Thaler gives his angle on this story:

In other cases, the research caused us to change our views on some subject. A good example of this is organ donations. When we made our list of topics, this was one of the first on the list because we knew of a paper that Eric Johnson had written with Daniel Goldstein on the powerful effect of default options in this domain. Most countries adopt some version of an opt-in policy, whereby donors have to take some positive step such as filling in a form in order to have their name added to the donor registry list. However, some countries in Europe, such as Spain, have adopted an opt-out strategy that is called “presumed consent.” You are presumed to give your permission to have your organs harvested unless you explicitly take the option to opt out and put your name on a list of “non-donors.”

The findings of Johnson and Goldstein’s paper showed how powerful default options can be. In countries where the default is to be a donor, almost no one opts out, but in countries with an opt-in policy, often less than half of the population opts in! Here, we thought, was a simple policy prescription: switch to presumed consent. But then we dug deeper. It turns out that most countries with presumed consent do not implement the policy strictly. Instead, medical staff members continue to ask family members whether they have any objection to having the deceased relative’s organs donated. This question often comes at a time of severe emotional stress, since many organ donors die suddenly in some kind of accident. What is worse is that family members in countries with this regime may have no idea what the donor’s wishes were, since most people simply do nothing. That someone failed to fill out a form opting out of being a donor is not a strong indication of his actual beliefs.

We came to the conclusion that presumed consent was not, in fact, the best policy. Instead we liked a variant that had recently been adopted by the state of Illinois and is also used in other U.S. states. When people renew their driver’s license, they are asked whether they wish to be an organ donor. Simply asking people and immediately recording their choices makes it easy to sign up. In Alaska and Montana, this approach has achieved donation rates exceeding 80%. In the organ donation literature this policy was dubbed “mandated choice” and we adopted that term in the book.

O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

WeaponsIn her interesting Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O’Neil defines Weapons of Math Destruction based on three criteria – opacity, unfairness and scale.

Opacity makes it hard to assess the fairness of mathematical models (I’ll use the term algorithms through most of this post), and it facilitates (or might even be a key component of) an algorithm’s effectiveness if it relies on naive subjects. “These bonds have been rated by maths geniuses – buy them.” Unfairness relates to whether the algorithm operates in the interest of the modelled subject. Scale is not just that algorithms can affect large numbers of people. Scale can also lead to the establishment of norms that do not allow anyone to escape the operation of the algorithm.

These three factors are common across most of the problematic algorithms O’Neil discusses, and she makes a strong and persuasive case that many algorithms could be developed or used better. But the way she combines many of her points, together with her politics, often makes it unclear what exactly the problem is or what potential solutions could (should) be.

A distinction that might have made this clearer (or at least that I found useful) is between algorithms that don’t do what the developer intends, algorithms working as intended but that have poor consequences for those on the wrong side of their application, and algorithms that have unintended consequences once released into the wild. The first is botched math, the second is math done well to the detriment of others, while the third is good or bad math with naive application.

For this post I am going to break O’Neil’s case into these three categories.

Math done poorly

When it comes to the botched math, O’Neil is at her best. Her tale of teacher scoring algorithms in Washington DC is a case where the model is not helping anyone. Teachers were scored based on the deviations of student test scores from those predicted by models of the students. The bottom 2% to 5% of teachers were fired. But the combination of modelled target student scores and small classrooms made the scoring of teachers little better than random. There was almost no correlation in a teacher’s scores from one year to the next.

Her critique of the way many models are developed is also important. Are we checking the model is working, rather than just assuming that the teachers we fired are the right ones? She contrasts the effort typically spent testing a recidivism model (for use in determining prison sentences) to the way Amazon learns about its customers. Amazon doesn’t simply develop a single “recidivism score” equivalent and take that as determinative. Instead they continue to test and learn as much as they can about their interactions with customers to make the best models they can.

The solutions to the botched math are simpler (at least in theory) than many of the other problems she highlights. The teacher scoring models simply require someone with competence to consider what it is they might want to care about and measure, and if it can be done, work out whether it can be done in a statistically meaningful way. If it can’t, so be it. The willingness to concede that they can’t develop a meaningful model is important if that is the case, particularly if it is designed to inform high-stakes decisions. Similarly, recidivism scoring algorithms should be subject to constant scrutiny.

But this raises the question of how you assess an algorithm. What is the appropriate benchmark? Perfection? Or the system it is replacing? At times O’Neil places a heavy focus on the errors of the algorithm, with little focus on the errors of the alternative – the humans it replaced. Many of O’Neil’s stories involve false positives, leading to a focus on the obvious algorithm errors, with the algorithm’s greater accuracy and the human errors unseen. A better approach might be to simply compare alternative approaches and see which is better, rather than having the human as the default. Once the superior alternative is selected, we also need to remain cognisant that the best option still might not be very good.

As O’Neil argues, some of the poor models would also be less harmful if they were transparent. People could pull the models apart and see whether they were working as intended. A still cleaner version might be to just release the data and let people navigate it themselves (e.g. education data), although this is not without problems. Whatever is the most salient way of sorting and ranking will become the new defacto model. If we don’t do it ourselves, someone will take that data and give us the ranking we crave.

Math done well (for the user anyhow)

When comes to math done well, O’Neil’s three limbs of the WMD definition – opacity, unfairness and scale – are a good description of the problems she sees. O’Neil’s critique is usually not so much about the maths, but the unfair use of the models for purposes such as targeting of the poor (think predatory advertising by private colleges or payday lenders) or treating workers as cogs in the machine through the use of scheduling software.

In these cases, it is common that the person being modelled does not even know about the model (opacity). And if they could see the model, it may be hard to understand what characteristics are driving the outcome (although this is not so different to the opacity of human decision-making). The outcome then determines how we are treated, the ads we see, the prices we see, and so on.

One of O’Neil’s major concerns about fairness is that the models discriminate. She suggests they discriminate against the poor, African-Americans and those with mental illness. This is generally not through a direct intention to discriminate against these groups, although O’Neil gives the example of a medical school algorithm rejecting applicants based on birthplace due to biased training data. Rather, the models use proxies for the variables of interest, and those proxies also happen to correlate with certain group features.

This points to the tension in the use of many of these algorithms. Their very purpose is to discriminate. They are developed to identify the features that, say, employers or lenders want. Given there is almost always a correlation between those features and some groups, you will inevitably “discriminate” against them.

So what is appropriate discrimination? O’Neil objects to tarring someone with group features. If you live in a certain postcode, is it fair to be categorised with everyone else in that postcode? Possibly not. But if you have an IQ that is judged likely to result in poor job performance or creditworthiness based on the past performance of other people with that IQ, is that acceptable? What of having a degree?

The use of features such as postcodes, IQ or degrees come from the need to identify proxies for the traits people want to identify, such as whether they will pay back the loan or deliver good work performance. Each proxy varies in the strength of prediction, so the obvious solution seems to be to get more data and better proxies. Which of these is going to give us the best prediction of what we actually care about?

But O’Neil often balks at this step. She tells the story of a chap who can’t get minimum wage job due to his results on a five-factor model personality test, despite his “near perfect SAT”. The scale of the use of this test means he runs into this barrier with most employers. When O’Neil points out that personality is only one-third as predictive as cognitive tests, she doesn’t make the argument that employers should be allowed to use cognitive tests. She even suggests that employers are rightfully barred from using IQ tests in recruitment (as per a 1971 Supreme Court case). But absent the cognitive tests, do employers simply turn to the next best thing?

Similarly, when O’Neil complains about the use of “e-scores” (proxies for credit scores) in domains where entities are not legally allowed to use credit scores to discriminate, she complains that they are using a “sloppy substitute”. But again she does not complain about the ban on using the more direct measures.

There are also two sides to the use of these proxies. While the use of the proxies may result in some people being denied a job or a loan, it may allow someone else to get that job or loan, or to pay a better price, when a cruder measure might have seen that person being rejected.

O’Neil gives the example of ZestFinance, a payday lender that typically charges 60% lower than the industry standard. ZestFinance does this by finding every proxy for creditworthiness it can, picking out proxies such as correct use of capitalisation on the application form, and whether the applicant read the terms and conditions. O’Neil complains about those who are accepted for a loan but have to pay higher fees because of, say, poor spelling. This is something the poor and uneducated are more likely to incur. But her focus is on one type of outcome, those with more expensive loans (although probably still cheaper than from other payday lenders), leaving those people receiving the cheapest loans unseen. Should we deny this class of people the access to the cheaper finance these algorithms allow?

One interesting case in the book concerns the pricing of car insurance. An insurer wants to know who is the better driver, so they develop algorithms to price the risk appropriately. Credit scores are predictive of driving performance, so those with worse credit scores end up paying more for this.

But insurers also want to price discriminate to the extent that they can. That is, they want to charge each individual the highest price they will tolerate. Price discrimination can be positive for the poor. Price discrimination allows many airlines to offer cheap seats in the back of the plane when the business crowd insists on paying extra for a few inches of leg room. I benefited from the academic pricing of software for years, and we regularly see discounted pricing for students and seniors. But price discrimination can also allow the uninformed, lazy and those without options to be stripped of a few extra dollars. In the case of the insurer pricing algorithms, they are designed to price discriminate in addition to price the policy based on risk.

It turns out that credit score is not just predictive of driving performance, but also of buyer response to price changes. The resultant insurance pricing is an interaction of these two dimensions. O’Neil gives an example from Florida, where adults with clean driving records but poor credit scores paid $1,552 more (on average) than drivers with excellent credit but a drunk driving conviction, although it is unclear how much of this reflects risk and how much price discrimination.

Naive math

One of O’Neil’s examples of a what I will call naive math are those algorithms that create a self-reinforcing feedback loop. The model does what it is supposed to do – say, predict an event – but once used in a system, the model’s classification of a certain cohort becomes self-fulfilling or self-reinforcing.

For example, if longer prison sentences make someone more likely to offend on their release, any indicator that results in longer sentences will in effect become more strongly correlated with re-offending. Even if the model is updated to disentangle this problem, allowing the effect of the longer sentences to be isolated, the person who received a longer sentence is doomed the next time they are scored.

In a sense, the model does exactly what it should, predicting who will re-offend or not, and there is ample evidence that they do better than humans. But the application of the model does more than simply predicting recidivism. It might ultimately affirm itself.

Another example of a feedback loop is a person flagged as a poor credit risk. As they can’t get access to cheap credit, they then go to an expensive payday lender and ultimately run into trouble. That trouble is flagged in the credit scoring system, making it even harder for them to access financial services. If the algorithm made an error in the first instance – the person was actually a good credit risk – that person might then become a poor risk because the model effectively pushed them into more expensive products.

The solutions to these feedback loops are difficult. On the one hand, vigilant investigation and updating the models will help ameliorate the problems. O’Neil persuasively argues that we don’t do this enough. Entities such as ZestFinance that use a richer set of data can also break the cycle for some people.

But it is hard to solve the case for individual mis-classification. Any model will have false positives and false negatives. The model development process can only try to limit them, often with a trade-off between the two.

In assessing this problem we also need to focus on the alternative. Before these algorithms were developed, people would be denied credit, parole and jobs for all sorts of whimsical decisions on the part of the human decision makers. Those decisions would then result in feedback loops as their failures are reflected in future outcomes. The algorithms might be imperfect, but can be an improvement.

This is where O’Neil’s scale point becomes interesting. In a world of diverse credit scoring mechanisms, a good credit risk who is falsely identified as a poor risk under one measure might by accurately classified under another. The false positive is not universal, allowing them to shop around for the right deal. But if every credit provider uses the same scoring system, someone could be universally barred. The pre-algorithm world, for all its flaws, possibly provided more opportunities for someone to find the place where they are not incorrectly classified.

A final point on naive models (although O’Neil has more) is that models reflect goals and ideology. Sometimes this is uncontroversial – we want to keep dangerous criminals off the street. Sometimes this is more complicated – what risk of false positives are we willing to tolerate in keeping those criminals off the street? In many ways the influence of O’Neil’s politics on her critique provide the case in support of this point.

Solutions

Before reading the book, I listened to O’Neil on an Econtalk episode with Russ Roberts. There she makes the point that where we run into flawed algorithms, we shouldn’t always be going back to the old way of doing things (she made that comment in the context of judges). We should be making the algorithms better.

That angle was generally absent from the book. O’Neil takes the occasional moment to acknowledge that many algorithms are not disrupting perfect decision-making systems, but are replacing biased judges, bank managers who favoured their friends, and unstructured job interviews with no predictive power. But through the book she seems quite willing to rip those gains down in the name of fairness.

More explicitly, O’Neil asks whether we should sacrifice efficiency for fairness. For instance, should we leave some data out? In many cases we already do this, by not including factors such as race. But should this extend to factors such as who someone knows, their job or their credit score.

O’Neil’s choice of factors in this instance is telling. She asks whether someone’s connections, job or credit score should be used in a recidivism model, and suggests no as they would be inadmissible in court. But this is a misunderstanding of the court process. Those factors are inadmissible in determining guilt or innocence, but form a central part of sentencing decisions. Look at the use of referees or stories about someone’s tough upbringing. So is O’Neil’s complaint about the algorithm, or the way we dispense criminal justice in general? This reflects a feeling I had many times in the book that O’Neil’s concerns are much deeper than the effect of algorithms and extend to the nature of the systems themselves.

Possibly the point on which I disagree with O’Neil most is her suggestion that human decision-making has a benefit in that it can evolve and adapt. In contrast, a biased algorithm does not adapt until someone fixes it. The simple question I ask is where is the evidence of human adaptation? You just need to look at all the programs to eliminate workplace bias with no evidence of effectiveness for a taste of how hard it is to deliberately change people. We continue to be prone to seeing spurious correlations, and making inconsistent and unreliable decisions. For many human decisions there is simply no feedback loop as to whether we made the right decision. How will a human lender ever know they rejected a good credit risk?

While automated systems are stuck until someone fixes them, someone can fix them. And that is often what happens. Recently several people forwarded to me an article on the inability of some facial recognition systems to recognise non-Caucasian faces. But beyond the point that humans also have this problem (yes, “they all look alike“), the problem with facial recognition algorithms has been identified and, even though it is a tough problem, there are major efforts to fix it. (Considering some of the major customers of this technology are police and security services, there is an obvious interest in solving it.) In the meantime, those of us raised in a largely homogeneous population are stuck with our cross-racial face blindness.

Is it irrational?

Behavioral Scientist OwlOver at Behavioral Scientist magazine my second article, Rationalizing the ‘Irrational’, is up.

In the article I suggest that an evolutionary biology lens can give us some insight into what drives peoples’ actions. By understanding someone’s actual objectives, we are better able to determine whether their actions are likely to achieve their goals. Are they are behaving “rationally”?

Although the major thread of the article is evolutionary, in some ways that is not the main point. For me the central argument is simply that when we observe someone else’s actions, we need to exercise a degree of humility in assessing whether they are “rational”. We possibly don’t even know what they are trying to achieve, let alone whether their actions are the best way to achieve it.

Obviously, this new article pursues a somewhat different theme to my first in Behavioral Scientist, which explored the balance between human and algorithmic decision making. After discussing possible topics for my first article with the editor who has been looking after me to date (DJ Neri), I sent sketches of two potential articles. We decided to progress both.

My plan for my next article is to return to the themes from the first. I’ve recently been thinking and reading about algorithm aversion, and why we resist using superior decision tools when they are available. Even if the best solution is to simply use the algorithm or statistical output, the reality is that people will typically be involved. How can we develop systems where they don’t mess things up?

Kasparov’s Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins

KasparovIn preparation for my recent column in The Behavioral Scientist, which opened with the story of world chess champion Garry Kasparov’s defeat by the computer Deep Blue, I read Kasparov’s recently released Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins.

Despite the title and Kasparov’s interesting observations on the computer-human relationship, Deep Thinking is more a history of man versus machine in chess than a deep analysis of human or machine intelligence. Kasparov takes us from the earliest chess program, produced by Alan Turing on a piece of paper in 1952, through to a detailed account of Kasparov’s 1997 match against the computer Deep Blue, and then beyond.

Kasparov’s history provides an interesting sense of not just the process toward a machine defeating the world champion, but also when computers overtook the rest of us. In 1977 Kasparov had the machines ahead of all but the top 5% of humans. From the perspective of the average human versus machine, the battle is over decades before the machine is better than the best human. And even then the competition at the top levels is brief. As Kasparov puts it, we have:

Thousands of years of status quo human dominance, a few decades of weak competition, a few years of struggle for supremacy. Then, game over. For the rest of human history, as the timeline draws into infinity, machines will be better than humans at chess. The competition period is a tiny dot on the historical timeline.

As Kasparov also discusses, his defeat did not completely end the competition between humans and computers in chess. He describes a 1995 competition in what was called “freestyle chess”, whereby people were free to mix humans and machines as they see fit. To his surprise, the winners of this competition were not a grandmaster teamed with a computer, but a pair of amateur Americans using three computers at the same time. As Kasparov puts it, a weak human + machine + better process is superior to a strong human + machine + inferior process. There is still hope for the humans.

That hope, however, and the human-computer partnership, is also short-lived.  Kasparov notes that the algorithms will continue to improve and the hardware will get faster until the human partnership adds nothing to the mix. Kasparov’s position does not seem that different to my own.

One thing clear through Kasparov’s tale is that he does not consider chess to be the best forum for exploring machine intelligence. This was due to both the nature of chess itself, and the way in which those trying to develop a machine to defeat a world champion (particularly IBM) went about the task.

On the nature of chess, chess is just not complex enough. Its constraints – eight by eight board with sixteen pieces a side – meant that it was amenable to algorithms built using a combination of fixed human knowledge and brute force computational power. From the 1970s onward, developers of chess computers realised that this was the case, so much of the focus was on increasing computational power and refining algorithms for efficiency until they inevitably reach world champion standard.

The nature of these algorithms is best understood in the context of two search techniques described by Claude Shannon in 1949. Type A search is the process of going through every possible combination of moves deeper and deeper with each pass – one move deep, two moves deep and so on. The Type B search is more human-like, focusing on the few most promising moves and examining those in great depth. The development of Type B processes would provide more insight into machine intelligence.

The software that defeated Kasparov, along with most other chess software, used what Kasparov calls alpha-beta search. Alpha-beta search is a Type A approach that stops searching down any particular path whenever a move being examined has a lower value than the currently selected move. This process and increases in computational power were the keys to chess being vulnerable to the brute force attack. Although enormous amounts of work also went into Deep Blue’s openings and evaluation function, another few years would have seen Kasparov or his successor defeated by something far less highly tuned. His defeat was somewhat inevitable.

IBM’s approach to the contest also did not add much to the exploration of machine intelligence. As became clear to Kasparov in the lead up to the Deep Blue rematch (he had defeated Deep Blue in 1996), IBM was not interested in the science behind the enterprise, but simply wanted to win. It provided great advertising for IBM, but the machine logs of the contest were not made available and Deep Blue was later trashed. It’s an interesting contrast to IBM’s approach with Jeopardy winning Watson, which now seems to be everywhere.

As a result, Kasparov sees the AlphaGo project as a more interesting AI project than anything behind the top chess machines. The complexity of Go – a 19 by 19 board and 361 stones – requires the use of techniques such as neural networks. AlphaGo had to teach itself to play.

Even though Kasparov’s offerings on human and machine on intelligence are relatively thin, the chess history in itself makes the book worth reading. Kasparov’s story differs from some of the “myths” that have spread about that contest over the last 20 years, with Kasparov critical of many commentator’s interpretations of events.

One story Kasparov attacks is Nate Silver’s version in The Signal and the Noise (at which time Kasparov also takes a few swings at Silver’s understanding of chess). Silver’s story starts at the conclusion of game 1 of the match. When Kasparov considered his victory near complete, Deep Blue moved a rook in a highly unusual move – a move that turned out the be a “bug” in Deep Blue’s programming. As he did not understand it was a bug, Kasparov saw the move as a sign that the machine could see mate by Kasparov in 20 or more moves, and was seeking to delay this defeat. Kasparov was so impressed by the depth of Deep Blue’s calculations that it affected his play for the rest of the match and was the ultimate cause of his loss.

As Kasparov tells in his version, he simply discarded Deep Blue’s move as the type of inexplicable move computers tend to make when lost. Instead, his state of mind suffered most severely when he was defeated in game 2. Through game 2 he played an unnatural (to him) style of anti-computer chess, and overlooked a potential chance to draw the game through perpetual check (he was informed of his missed opportunity the next day). He simply wasn’t looking for opportunities that he thought a computer would have spotted.

Humans vs algorithms

Behavioral Scientist OwlMy first column over at the Behavioral Scientist is live.

The column is an attempt to bring together two potentially conflicting stories.

The first is that the best decisions result from humans and machines working together. This is encapsulated in the story of freestyle chess, whereby the best software is trumped by a human-computer team.

The other is the deep literature on whether humans or algorithms make better decisions, starting with Paul Meehl’s classic Clinical Versus Statistical Prediction. The common story in this literature is that there are few domains where humans outperform statistical or algorithmic approaches (even relatively simple ones). There is also an admittedly thinner literature on what happens when humans can have the result of the algorithm and decide whether to use or overrule it, and the story there is that people should generally leave the algorithm alone.

If you take the latter to be the usual case, the world will not be so much like freestyle chess, but more a case of steady replacement of humans decision by decision. The humans will remain relevant not because they can improve the algorithm’s decisions, but because there are inputs we need to provide, there are domains the algorithms cannot go yet, or we just don’t want to hand over control.

You can read the column here.

The “effect is too large” heuristic

Daniel Lakens writes:

I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study [by Danziger and friends] on judges handing out harsher sentences before lunch than after lunch. The idea is that their mental resources deplete over time, and they stop thinking carefully about their decision – until having a bite replenishes their resources. The study is well-known, and often (as in the Radiolab episode) used to argue how limited free will is, and how much of our behavior is caused by influences outside of our own control. I had never read the original paper, so I decided to take a look.

During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided upon. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned – it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!

I think we should dismiss this finding, simply because it is impossible. When we interpret how impossibly large the effect size is, anyone with even a modest understanding of psychology should be able to conclude that it is impossible that this data pattern is caused by a psychological mechanism. As psychologists, we shouldn’t teach or cite this finding, nor use it in policy decisions as an example of psychological bias in decision making.

I was aware of one explanation for why the effect reported by Danziger and friends was so large. Andreas Glockner explored what would occur if favourable rulings took longer than unfavourable rulings, and the judge (rationally) plans ahead and stops for their break if they believe the case will take longer than there is time left in the session. Simulating this scenario, Glockner generated an effect of similar magnitude to the original paper.

However, I was never convinced the case ordering was random, a core assumption behind Danziger and friends’ finding. In my brief legal career I often attended preliminary court hearings where matters were listed in a long (possibly random) court list. Then the order emerged. Those with legal representation would go first. Senior lawyers would get priority over junior lawyers. Matters for immediate adjournment would be early. And so on. There was no formal procedure for this to occur other than discussion with the court orderly before and during the session.

It turns out that these Israeli judges (or, I should say, a panel of a judge, a criminologist and a social worker) experienced a similar dynamic. Lakens points to a PNAS paper in which Keren Weinshall-Margela (of the Israeli Supreme Courts research division) and John Shapard investigated whether the ordering of cases was actually random. The answer was no:

We examined data provided by the authors and obtained additional data from 12 hearing days (n = 227 decisions). We also interviewed three attorneys, a parole panel judge, and five personnel at Israeli Prison Services and Court Management, learning that case ordering is not random and that several factors contribute to the downward trend in prisoner success between meal breaks. The most important is that the board tries to complete all cases from one prison before it takes a break and to start with another prison after the break. Within each session, unrepresented prisoners usually go last and are less likely to be granted parole than prisoners with attorneys.

Danziger and friends have responded to these claims and attempted to resuscitate their article, but here is something to be said for the “effect is too large” heuristic proposed by Lakens. No amount of back and forth about the finer details of the methodology can avoid that point.

The famous story about the effect of defaults on organ donation provides another example. When I first heard the claim that 99.98% of Austrians, but only 12% of Germans, are organ donors due to the default organ donation option in their driver licence renewal, I simply thought the size of the effect was unrealistic. Do only 2 in 10,000 Austrians tick the box? I would assume more that 2 in 10,000 would tick it by mistake, thinking that would make them organ donors. So when you turn to the original paper or examine the actual organ donation process you will see this has nothing to do with driver’s licences or ticking boxes. The claimed effect size and the story simply did not line up.

Andrew Gelman often makes a similar point. Much research in the social sciences reflects an attempt to find tiny effects in noisy data, and any large effects we find are likely gross overestimates of the true effect (to the extent the effect exists). Gelman and John Carlin call this a Type M error.

Finally, I intended to include Glockner’s paper in my critical behavioural economics and behavioural science reading list, but it slipped my mind. I have now included it and these other articles for a much richer story.