Coursera’s Data Science Specialisation: A Review

As I mentioned in my comments on Coursera’s Executive Data Science specialisation, I have looked at a lot of online data science and statistics courses to find useful training material, understand the skills of people who have done these online courses, plus learn a bit myself.

One of the best known sets of courses is Coursera’s Data Science Specialisation, created by John Hopkins University. It is a ten course program that covers the data science process from data collection to the production of data science products. It focuses on implementing the data science process in R.

This specialisation is a signal that someone is familiar with data analysis in R – and the units are not bad if learning R is your goal. But this specialisation (nor any other similar length course I have reviewed to date) doesn’t offer a shortcut to the statistical knowledge necessary for good data science. A few university length units seem to be the minimum, and even they need to be paired with experience and self-directed study (not to mention some skepticism of what we can determine).

The specialisation assessments are such that you can often pass the courses without understanding what you have been taught. Points for some courses are awarded for “effort” (see Statistical Inference below). While capped at three attempts per 8 hours, the multiple choice quizzes have effectively unlimited attempts. I don’t have a great deal of faith in university assessment processes either – particularly in Australia where no-one wants to disrupt the flood of fees from international students by failing someone – but the assessment in these specialisations require even less knowledge or effort. They’re not much of a signal of anything.

If you are wondering whether you should audit or pay for the specialisation, you can’t submit the assignments under the audit option. But the quizzes are basic and you can find plenty of assignment submissions on GitHub or RPubs against which you can check your work.

Here are some notes on each course. I looked through each of these over a year or so, so there might be some updates to the earlier courses (although a quick revisit suggests my comments still apply).

  1. The Data Scientist’s Toolbox: Little more than an exercise in installing R and git, together with an overview of the other courses in the specialisation. If you are familiar with R and git, skip.
  1. R Programming: In some ways the specialisation could have been called R Programming. This unit is one of the better of the ten, and gives a basic grounding in R.
  1. Getting and Cleaning Data: Not bad for getting a grasp of the various ways of extracting data into R, but watching video after video of imports of different formats makes for less-than exciting viewing. The principles on tidy data are important – the unit is worth doing for this alone.
  1. Exploratory Data Analysis: Really a course in charting in R, but a decent one at that. There is some material on principal components analysis and clustering that will likely go over most people’s heads – too much material in too little time.
  1. Reproducible Research: The subject of this unit – literate (statistical) programming – is one of the more important subjects covered in the specialisation. However, this unit seemed cobbled together – lectures repeated points and didn’t seem produced to a logical structure. The last lecture is a conference video (albeit one worth watching). If you compare this unit to the (outstanding) production effort that has gone into the Applied Data Science with Python specialisation, this unit compares poorly.
  1. Statistical Inference: Likely too basic for someone with a decent stats background, but confusing for someone without. This unit hits home how it isn’t possible to build a stats background in a couple of hours a week over four weeks. The peer assessment caters to this through criteria such as “Here’s your opportunity to give this project +1 for effort.”, with option “Yes, this was a nice attempt (regardless of correctness)”.
  1. Regression Models: As per statistical inference, but possibly even more confusing for those without a stats background.
  1. Practical Machine Learning: Not a bad course for getting across implementing a few machine learning models in R, but there are better background courses. Start with Andrew Ng’s Machine Learning, and then work through Stanford’s Statistical Learning (which also has great R materials). Then return to this unit for a slightly different perspective. As for many of the other specialisation units, it is at a level too high for someone with no background. For instance, there is no point where they actually describe what machine learning is.
  1. Developing Data Products: This course is quite good, covering some of the major publishing tools, such as Shiny, R Markdown and Plotly (although skip the videos on Swirl). The strength of this specialisation is training in R, and that is what this unit focuses on.
  1. Data Science Capstone: This course can be best thought of as a commitment device that will force you to learn a certain amount about natural language processing in R (the topic of the project). You are given a task with a set of milestones, and you’re left to figure it out for yourself. Unless you already know something about natural language processing, you will have to review other courses and materials and spend a lot of time on the discussion boards to get yourself across the line. Skip it and do a natural language processing course such as Coursera’s Applied Text Mining in Python (although this assumes a fair bit of skill in Python). Besides, you can only access the capstone if you have paid for and completed the other nine units in the specialisation.

Perrow’s Normal Accidents: Living with High-Risk Technologies

A typical story in Charles Perrow’s Normal Accidents: Living with High-Risk Technologies runs like this.

We start with a plant, airplane, ship, biology laboratory, or other setting with a lot of components (parts, procedures, operators). Then we need two or more failures among components that interact in some unexpected way. No one dreamed that when X failed, Y would also be out of order and the two failures would interact so as to both start a fire and silence the fire alarm. Furthermore, no one can figure out the interaction at the time and thus know what to do. The problem is just something that never occurred to the designers. Next time they will put in an extra alarm system and a fire suppressor, but who knows, that might just allow three more unexpected interactions among inevitable failures. This interacting tendency is a characteristic of a system, not of a part or an operator; we will call it the “interactive complexity” of the system.

For some systems that have this kind of complexity, … the accident will not spread and be serious because there is a lot of slack available, and time to spare, and other ways to get things done. But suppose the system is also “tightly coupled,” that is, processes happen very fast and can’t be turned off, the failed parts cannot be isolated from other parts, or there is no other way to keep the production going safely. Then recovery from the initial disturbance is not possible; it will spread quickly and irretrievably for at least some time. Indeed, operator action or the safety systems may make it worse, since for a time it is not known what the problem really is.

Take this example:

A commercial airplane … was flying at 35,000 feet over Iowa at night when a cabin fire broke out. It was caused by chafing on a bundle of wire. Normally this would cause nothing worse than a short between two wires whose insulations rubbed off, and there are fuses to take care of that. But it just so happened that the chafing took place where the wire bundle passed behind a coffee maker, in the service area in which the attendants have meals and drinks stored. One of the wires shorted to the coffee maker, introducing a much larger current into the system, enough to burn the material that wrapped the whole bundle of wires, burning the insulation off several of the wires. Multiple shorts occurred in the wires. This should have triggered a remote-control circuit breaker in the aft luggage compartment, where some of these wires terminated. However, the circuit breaker inexplicably did not operate, even though in subsequent tests it was found to be functional. … The wiring contained communication wiring and “accessory distribution wiring” that went to the cockpit.

As a result:

Warning lights did not come on, and no circuit breaker opened. The fire was extinguished but reignited twice during the descent and landing. Because fuel could not be dumped, an overweight (21,000 pounds), night, emergency landing was accomplished. Landing flaps and thrust reversing were unavailable, the antiskid was inoperative, and because heavy breaking was used, the brakes caught fire and subsequently failed. As a result, the aircraft overran the runway and stopped beyond the end where the passengers and crew disembarked.

As Perrow notes, there is nothing complicated in putting a coffee maker on a commercial aircraft. But in a complex interactive system, simple additions can have large consequences.

Accidents of this type in complex, tightly coupled systems are what Perrow calls a “normal accident”. When Perrow uses the word “normal”, he does not mean these accidents are expected or predictable. Many of these accidents are baffling. Rather, it is an inherent property of the system to experience an interaction of this kind from time to time.

While it is fashionable to talk of culture as a solution to organisational failures, in complex and tightly coupled systems even the best culture is not enough. There is no improvement to culture, organisation or management that will eliminate the risk. That we continue to have accidents in industries with mature processes, good management and decent incentives not to blow up suggests there might be something intrinsic about the system behind these accidents.

Perrow’s message on how we should deal with systems prone to normal accidents is that we should stop trying to fix them in ways that only make them riskier. Adding more complexity is unlikely to work. We should focus instead on reducing the potential for catastrophe when there is failure.

In some cases, Perrow argues that the potential scale of the catastrophe is such that the systems should be banned. He argues nuclear weapons and nuclear energy are both out on this count. In other systems, the benefit is such that we should continue tinkering to reduce the chance of accidents, but accept they will occur despite our best efforts.

One possible approach to complex, tightly coupled systems is to reduce the coupling, although Perrow does not dwell deeply on this. He suggests that the aviation industry has done this to an extent through measures such as corridors that exclude certain types of flights. But in most of the systems he examines, decoupling appears difficult.

Despite Perrow’s thesis being that accidents are normal in some systems, and that no organisational improvement will eliminate them, he dedicates a considerable effort to critiquing management error, production pressures and general incompetence. The book could have been half the length with a more focused approach, but it does suggest that despite the inability to eliminate normal accidents, many complex, tightly coupled systems could be made safer through better incentives, competent management and the like.

Other interesting threads:

  • Normal Accidents was published in 1984, but the edition I read had an afterword written in 1999 in which Perrow examined new domains to which normal accident theory might be applied. Foreshadowing how I first came across the concept, he points to financial markets as a new domain for application. I first heard of “normal accidents” in Tim Harford’s discussion financial markets in Adapt. Perrow’s analysis of the upcoming Y2K bug under his framework seems slightly overblown in hindsight.
  • The maritime accident chapter introduced (to me) the concepts of radar assisted collisions and non collision course collisions. Radar assisted collisions are a great example of the Peltzman effect, whereby vessels that would have once remained stationary or crawled through fog now speed through. The first vessels with radar were comforted that they could see all the stationary or slow-moving obstacles as dots on their radar screen. But as the number of vessels with radars increased and those other dots also start moving with speed, we have more radar assisted collisions. On non collision course collisions, Perrow notes that most collisions involve two (or more) ships that were not on a collision course, but on becoming aware of each other managed to change course to effect a collision. Coordination failures are rife.
  • Perrow argues that nuclear weapon systems are so complex and prone to failure that there is inherent protection against catastrophic accident. Not enough pieces are likely to work to give us the catastrophe. Of course, this gives reason for concern about whether they will work when we actually need them (again, maybe a positive). Perrow even asks if complexity and coupling can be so problematic that the system ceases to exist.
  • Perrow spends some time critiquing hindsight bias in assessing accidents. He gives one example of a Union Carbide plant that received a glowing report from a US government department. Following an accidental gas release some months later, that same government department described the plant as accident waiting to happen. I recommend Phil Rosenzweig’s The Halo Effect for a great analysis of this problem in assessing the factors behind business performance after the fact.

The benefit of doing nothing

From Tim Harford:

[I]n many areas of life we demand action when inaction would serve us better.

The most obvious example is in finance, where too many retail investors trade far too often. One study, by Brad Barber and Terrance Odean, found that the more retail investors traded, the further behind the market they lagged: active traders underperformed by more than 6 percentage points (a third of total returns) while the laziest investors enjoyed the best performance.

This is because dormant investors not only save on trading costs but avoid ill-timed moves. Another study, by Ilia Dichev, noted a distinct tendency for retail investors to pile in when stocks were riding high and to sell out at low points. …

The same can be said of medicine. It is a little unfair on doctors to point out that when they go on strike, the death rate falls. Nevertheless it is true. It is also true that we often encourage doctors to act when they should not. In the US, doctors tend to be financially rewarded for hyperactivity; everywhere, pressure comes from anxious patients. Wiser doctors resist the temptation to intervene when there is little to be gained from doing so — but it would be better if the temptation was not there. …

Harford also reflects on the competition between humans and computers, covering similar territory to that in my Behavioral Scientist article Don’t Touch the Computer (even referencing the same joke).

The argument for passivity has been strengthened by the rise of computers, which are now better than us at making all sorts of decisions. We have been resisting this conclusion for 63 years, since the psychologist Paul Meehl published Clinical vs. Statistical Prediction. Meehl later dubbed it “my disturbing little book”: it was an investigation of whether the informal judgments of experts could outperform straightforward statistical predictions on matters such as whether a felon would violate parole.

The experts almost always lost, and the algorithms are a lot cleverer these days than in 1954. It is unnerving how often we are better off without humans in charge. (Cue the old joke about the ideal co-pilot: a dog whose job is to bite the pilot if he touches the controls.)

The full article is here.

Alter’s Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching

I have a lot of sympathy for Adam Alter’s case in Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching. Despite the abundant benefits of being online, the hours I have burnt over the last 20 years through aimless internet wandering and social media engagement could easily have delivered a book or another PhD.

It’s unsurprising that we are surrounded by addictive tech. Game, website and app designers are all designing their products to gain and hold our attention. In particular, the tools at the disposal of modern developers are fantastic at introducing what Alter describes as the six ingredients of behavioural addition:

[C]ompelling goals that are just beyond reach; irresistible and unpredictable positive feedback; a sense of incremental progress and improvement; tasks that become slowly more difficult over time; unresolved tensions that demand resolution; and strong social connections.

Behavioural addictions have a lot of similarity with substance addictions (some people question whether we should distinguish between them at all). They activate the same brain regions. They are fueled by some of the same human needs, such as the need for social engagement and support, mental stimulation and a sense of effectiveness. [Parts of the book seem to be a good primer on addiction, although see my endnote.]

Based on one survey of the literature, as many as 41 per cent of the population may have suffered a behavioural addiction in the past month. While having so many people classified as addicts dilutes the concept of “addiction”, it does not seem unrealistic given the way many people use tech.

As might be expected given the challenge, Alter’s solutions on how we can manage addiction in the modern world fall somewhat short of providing a fix. For one, Alter suggests we need to start training the young when they are first exposed to technology. However, it is likely that the traps present in later life will be much different from those present when young. After all, most of Alter’s examples of addicts were born well before the advent of World of Warcraft, the iPhone or the iPad that derailed them.

Further, the ability of tech to capture our attention is only in its infancy. It is not hard to imagine the eventual creation of immersive virtual worlds so attractive that some people will never want to leave.

Alter’s chapter on gamification is interesting. Gamification is the idea of turning a non-game experience into a game. One of the more inane but common examples of gamification is turning a set of stairs into a piano to encourage people to take those stairs in preference to the neighbouring escalator (see on YouTube). People get more exercise as a result.

The flip side is that gamification is part of the problem itself (unsurprising given the theme of Alter’s book). For example, exercise addicts using wearables can lose sight of why they are exercising. They push on for their gamified goals despite injuries and other costs. One critic introduced by Alter is particularly scathing:

Bogost suggested that gamification “was invented by consultants as a means to capture the wild, coveted beast that is video games and to domesticate it.” Bogost criticized gamification because it undermined the “gamer’s” well-being. At best, it was indifferent to his well-being, pushing an agenda that he had little choice but to pursue. Such is the power of game design: a well-designed game fuels behavioral addiction. …

But Bogost makes an important point when he says that not everything should be a game. Take the case of a young child who prefers not to eat. One option is to turn eating into a game—to fly the food into his mouth like an airplane. That makes sense right now, maybe, but in the long run the child sees eating as a game. It takes on the properties of games: it must be fun and engaging and interesting, or else it isn’t worth doing. Instead of developing the motivation to eat because food is sustaining and nourishing, he learns that eating is a game.

Taking this critique further, Alter notes that “[c]ute gamified interventions like the piano stairs are charming, but they’re unlikely to change how people approach exercise tomorrow, next week, or next year.” [Also read this story about Bogost and his game Cow Clicker.]

There are plenty of other interesting snippets in the book. Here’s one on uncertainty of reward:

Each one [pigeon] waddled up to a small button and pecked persistently, hoping that it would release a tray of Purina pigeon pellets. … During some trials, Zeiler would program the button so it delivered food every time the pigeons pecked; during others, he programmed the button so it delivered food only some of the time. Sometimes the pigeons would peck in vain, the button would turn red, and they’d receive nothing but frustration.

When I first learned about Zeiler’s work, I expected the consistent schedule to work best. If the button doesn’t predict the arrival of food perfectly, the pigeon’s motivation to peck should decline, just as a factory worker’s motivation would decline if you only paid him for some of the gadgets he assembled. But that’s not what happened at all. Like tiny feathered gamblers, the pigeons pecked at the button more feverishly when it released food 50–70 percent of the time. (When Zeiler set the button to produce food only once in every ten pecks, the disheartened pigeons stopped responding altogether.) The results weren’t even close: they pecked almost twice as often when the reward wasn’t guaranteed. Their brains, it turned out, were releasing far more dopamine when the reward was unexpected than when it was predictable.

I have often wondered to what extent surfing is attractive due to the uncertain arrival of waves during a session, or the inconsistency in swell from day-to-day.

———

Now for a closing gripe. Alter tells the following story:

When young adults begin driving, they’re asked to decide whether to become organ donors. Psychologists Eric Johnson and Dan Goldstein noticed that organ donations rates in Europe varied dramatically from country to country. Even countries with overlapping cultures differed. In Denmark the donation rate was 4 percent; in Sweden it was 86 percent. In Germany the rate was 12 percent; in Austria it was nearly 100 percent. In the Netherlands, 28 percent were donors, while in Belgium the rate was 98 percent. Not even a huge educational campaign in the Netherlands managed to raise the donation rate. So if culture and education weren’t responsible, why were some countries more willing to donate than others?

The answer had everything to do with a simple tweak in wording. Some countries asked drivers to opt in by checking a box:

If you are willing to donate your organs, please check this box: □

Checking a box doesn’t seem like a major hurdle, but even small hurdles loom large when people are trying to decide how their organs should be used when they die. That’s not the sort of question we know how to answer without help, so many of us take the path of least resistance by not checking the box, and moving on with our lives. That’s exactly how countries like Denmark, Germany, and the Netherlands asked the question—and they all had very low donation rates.

Countries like Sweden, Austria, and Belgium have for many years asked young drivers to opt out of donating their organs by checking a box:

If you are NOT willing to donate your organs, please check this box: □

The only difference here is that people are donors by default. They have to actively check a box to remove themselves from the donor list. It’s still a big decision, and people still routinely prefer not to check the box. But this explains why some countries enjoy donation rates of 99 percent, while others lag far behind with donation rates of just 4 percent.

This story is rubbish, as I have posted about here, here, here and here. This difference has nothing to do with ticking boxes on driver’s licence forms. In Austria they are never even asked. 99 per cent of Austrians aren’t organ donors in the way anyone would normally define it. 99% are presumed to consent, and if they happen to die their organs might not be taken because the family objects (or whatever other obstacle gets in the way) in the absence of any understanding of the actual intentions of the deceased.

To top it off, Alter embellishes the incorrect version of the story as told by Daniel Kahneman or Dan Ariely with phrasing from driver’s licence forms that simply don’t exist. Did he even read the Johnson and Goldstein paper (ungated copy)?

After reading a well-written and entertaining book about a subject I don’t know much about, I’m left questioning whether this is a single slip or Alter’s general approach to his writing and research. How many other factoids from the book simply won’t hold up once I go to the original source?

Rats in a casino

From Adam Alter’s Irresistible: Why We Can’t Stop Checking, Scrolling, Clicking and Watching:

Juice refers to the layer of surface feedback that sits above the game’s rules. It isn’t essential to the game, but it’s essential to the game’s success. Without juice, the same game loses its charm. Think of candies replaced by gray bricks and none of the reinforcing sights and sounds that make the game fun. …

Juice is effective in part because it triggers very primitive parts of the brain. To show this, Michael Barrus and Catharine Winstanley, psychologists at the University of British Columbia, created a “rat casino.” The rats in the experiment gambled for delicious sugar pellets by pushing their noses through one of four small holes. Some of the holes were low-risk options with small rewards. One, for example, produced one sugar pellet 90 percent of the time, but punished the rat 10 percent of the time by forcing him to wait five seconds before the casino would respond to his next nose poke. (Rats are impatient, so even small waits register as punishments.) Other holes were high-risk options with larger rewards. The riskiest hole produced four pellets, but only 40 percent of the time—on 60 percent of trials, the rat was forced to wait in time-out for forty seconds, a relative eternity.

Most of the time, rats tend to be risk-averse, preferring the low-risk options with small payouts. But that approach changed completely for rats who played in a casino with rewarding tones and flashing lights. Those rats were far more risk-seeking, spurred on by the double-promise of sugar pellets and reinforcing signals. Like human gamblers, they were sucked in by juice. “I was surprised, not that it worked, but how well it worked,” Barrus said. “We expected that adding these stimulating cues would have an effect. But we didn’t realize that it would shift decision making so much.”

I’ll post some other thoughts on the book later this week.

Ip’s Foolproof: Why Safety Can Be Dangerous and How Danger Makes Us Safe

Greg Ip’s framework in Foolproof: Why Safety Can Be Dangerous and How Danger Makes Us Safe is the contrast between what he calls the ecologists and engineers. Engineers seek to use the sum of our human knowledge to make us safer and the world more stable. Ecologists recognise that the world is complex and that people adapt, meaning that many of our solutions will have unintended consequences that can be worse than the problems we are trying to solve.

Much of Ip’s book is a catalogue of the failures of engineering. Build more and larger levees, and people will move into those flood protected areas. When the levees eventually fail, the damage is larger than it would otherwise have been. There is a self reinforcing link between flood protection and development, ensuring the disasters grow in scale.

Similarly, if you put out every forest fire as soon as it pops up, eventually a large fire will get out of control and take advantage of the build up in fuel that occurred due to the suppression of the earlier fires.

Despite these engineering failures, there is often pressure for regulators or those with responsibility to keep us safe to act as engineers. In Yellowstone National Park, the “ecologists” had taken the perspective that fires did not have to be suppressed immediately, as in combination with prescribed burning they could reduce the build up of fuel. But the economic interests around Yellowstone, largely associated with tourism, fought this use of fire. After all, prescribed burning and letting fires burn for a while is not costless or risk free. But the build up of fuel from failure to bear those short term costs or risks, as much of the pressure was on them to do, results in the long-term risk of a massive fire.

Despite the problems with engineers, Ip suggests we need to take the best of both the engineering and ecologist approaches in addressing safety. Engineers have made car crashes more survivable. Improved flood protection allows us to develop areas that were previously out of reach. What we need to do, however, is not expect too much of the engineers. You cannot eliminate risks and accidents. Some steps to do so will simply shift, change or exacerbate the risk.

One element of Ip’s case for retaining parts of the engineering approach is confidence. People need a degree of confidence or they won’t take any risks. There are many risks we want people to take, such as starting a business or trusting their money with a bank. The evaporation of confidence can be the problem itself, so if you prevent the loss of confidence, you don’t actually need to deploy the safety device. Deposit insurance is the classic example.

Ip ultimately breaks down the balance of engineering and ecology to a desire to maximise the units of innovation per unit of instability. An acceptance of instability is required for people to innovate. This could be through granting people the freedom to take risks, or by creating an impression of safety (and a degree of moral hazard – the taking of risks when the costs are not borne by the risk taker) to retain confidence.

Despite being an attempt to balance the two approaches, the innovation versus instability formula sounds much like what an engineer might suggest. I agree with Ip that the simple ecologist solution of removing the impression of safety to expunge moral hazard is not without costs. But it is not clear to me that you would ever get this balance right through design. Part of the appeal of the ecologist approach is the acceptance of the complexity of these systems and an acknowledgement to the limits of our knowledge about them.

Another way that Ip frames his balanced landing point is that we should accept small risks and the benefits, and save the engineering for the big problems. Ip hints at, but does not directly get to, Taleb’s concept of anti-fragility in this idea. Antifragility would see us develop a system where those small shocks strengthen the system and not simply being a cost we incur to avoid moral hazard.

The price of risk

Some of Ip’s argument is captured by what is known as the Peltzman effect, named after University of Chicago economist Sam Peltzman. Peltzman published a paper in 1975 examining the effect of safety improvements in cars over the previous 10 years. Peltzman found a reduction in deaths per mile travelled for vehicle occupants, but also an increase in pedestrian injuries and property damage.

Peltzman’s point was that risky driving has a price. If safety improvements reduce that price, people will take more risk. The costs of that additional risk can offsett the safety gains.

While this is in some ways an application of basic economics – make something cheaper and people will consume more – the empirical evidence on the Peltzman effect is interesting.

On one level, it is obvious that the Peltzman effect does not make all safety improvements a waste of effort. The large declines in driver deaths relative to the distance travelled over the last 50 years, without fully offsetting pedestrian deaths or other damage, establishes this case.

But when you look at individual safety improvements, there are some interesting outcomes. In the case of seat belts, empirical evidence suggests the absence of the Peltzman effect. For example, one study looked at the effects across states as each introduced seatbelt laws and found a decrease in deaths but no increase in pedestrian fatalities.

In contrast, anti-lock brakes were predicted to materially reduce crashes, but the evidence suggests effectively no net change. Drivers with anti-lock brakes drive faster and brake harder. While reducing some risks – less front-end collisions – they increase others – such as the increased rear end collisions induced by their hard braking behaviour.

So why the difference between seatbelts and anti-lock brakes? Ip argues that the difference depends on what the safety improvement allows us to do and how it feeds back into our behaviour. Anti-lock brakes give a driver with a feeling of control and a belief they can drive faster. This belief is correct, but occasionally it backfires and they have an accident they would not have had otherwise. With seatbelts, most people want to avoid a crash and a car crash remains unpleasant even when wearing a seatbelt. At many times the seatbelt is not even in people’s minds.

Irrational risk taking?

One of the interesting threads through the book (albeit one that I wish Ip had explored in more detail) is the mix of rational and irrational decision making in our approach to risk.

Much of this “irrationality” concerns our myopia. We rebuild on sites where hurricanes and storms have swept away or destroyed the previous structures. The lack of personal experience with the disaster leads people to underweight the probability. We also have short memories, with houses built immediately after a hurricane being more likely to survive the next hurricane than those built a few years later.

A contrasting effect is our fear response to vivid events, which leads us to overweight them in our decision making despite the larger costs of the alternative.

But despite the ease in spotting these anomalies, for many of Ip’s real world examples of individual actions that might by myopic or irrational it wouldn’t be hard to craft an argument that the individual might be making a good decision. If the previous building on the site was destroyed by a hurricane, can you still get flood insurance (possibly subsidised), making it a good investment all the same? As Ip points out, there are also many benefits to living in disaster prone areas, which are often sites of great economic opportunity (such as proximity to water).

In a similar vein, Ip points to the individual irrationality of “overconfident” entrepreneurs, whose businesses will more often than not end up failing. But as catalogued by Phil Rosenzweig, the idea that these “failed” businesses generally involve large losses is wrong. Overconfident is a poor word to describe these entrepreneurs’ actions (see also here on overconfidence).

I have a other few quibbles with the book. One was when Ip’s discussion of our response to uncertainty conflated risk aversion with loss aversion, the certainty effect and the endowment effect. But as I say, they are just quibbles. Ip’s book is well worth the read.

Does presuming you can take a person’s organs save lives?

I’ve pointed out several times on this blog the confused story about organ donation arising from Johnson and Goldstein’s Do Defaults Save Lives? (ungated pdf). Even greats such as Daniel Kahneman are not immune from misinterpreting what is going on.

Again, here’s Dan Ariely explaining the paper:

One of my favorite graphs in all of social science is the following plot from an inspiring paper by Eric Johnson and Daniel Goldstein. This graph shows the percentage of people, across different European countries, who are willing to donate their organs after they pass away. …

But you will notice that pairs of similar countries have very different levels of organ donations. For example, take the following pairs of countries: Denmark and Sweden; the Netherlands and Belgium; Austria and Germany (and depending on your individual perspective France and the UK). These are countries that we usually think of as rather similar in terms of culture, religion, etc., yet their levels of organ donations are very different.

So, what could explain these differences? It turns out that it is the design of the form at the DMV. In countries where the form is set as “opt-in” (check this box if you want to participate in the organ donation program) people do not check the box and as a consequence they do not become a part of the program. In countries where the form is set as “opt-out” (check this box if you don’t want to participate in the organ donation program) people also do not check the box and are automatically enrolled in the program. In both cases large proportions of people simply adopt the default option.

Johnson and Goldstein (2003) Organ donation rates in Europe

I keep hearing this story in new places, so it’s clearly got some life to it (and I’ll keep harping on about it). The problem is that there is no DMV form. These aren’t people “willing” to donate their organs. And a turn to the second page of Johnson and Goldstein’s paper makes it clear that the translation from “presumed consent” to donation appears mildly positive but is far from direct. 99.98% of Austrians (or deceased Austrians with organs suitable for donation) are not organ donors.

Although Johnson and Goldstein should not be blamed for the incorrect stories arising from their paper, I suspect their choice of title – particularly the word “default” – has played some part in allowing the incorrect stories to linger. What of an alternative title “Does presuming you can take a person’s organs save lives?”

One person who is clear on the story is Richard Thaler. In his surprisingly good book Misbehaving (I went in with low expectations after reading some reviews), Thaler gives his angle on this story:

In other cases, the research caused us to change our views on some subject. A good example of this is organ donations. When we made our list of topics, this was one of the first on the list because we knew of a paper that Eric Johnson had written with Daniel Goldstein on the powerful effect of default options in this domain. Most countries adopt some version of an opt-in policy, whereby donors have to take some positive step such as filling in a form in order to have their name added to the donor registry list. However, some countries in Europe, such as Spain, have adopted an opt-out strategy that is called “presumed consent.” You are presumed to give your permission to have your organs harvested unless you explicitly take the option to opt out and put your name on a list of “non-donors.”

The findings of Johnson and Goldstein’s paper showed how powerful default options can be. In countries where the default is to be a donor, almost no one opts out, but in countries with an opt-in policy, often less than half of the population opts in! Here, we thought, was a simple policy prescription: switch to presumed consent. But then we dug deeper. It turns out that most countries with presumed consent do not implement the policy strictly. Instead, medical staff members continue to ask family members whether they have any objection to having the deceased relative’s organs donated. This question often comes at a time of severe emotional stress, since many organ donors die suddenly in some kind of accident. What is worse is that family members in countries with this regime may have no idea what the donor’s wishes were, since most people simply do nothing. That someone failed to fill out a form opting out of being a donor is not a strong indication of his actual beliefs.

We came to the conclusion that presumed consent was not, in fact, the best policy. Instead we liked a variant that had recently been adopted by the state of Illinois and is also used in other U.S. states. When people renew their driver’s license, they are asked whether they wish to be an organ donor. Simply asking people and immediately recording their choices makes it easy to sign up. In Alaska and Montana, this approach has achieved donation rates exceeding 80%. In the organ donation literature this policy was dubbed “mandated choice” and we adopted that term in the book.

O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

WeaponsIn her interesting Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Cathy O’Neil defines Weapons of Math Destruction based on three criteria – opacity, unfairness and scale.

Opacity makes it hard to assess the fairness of mathematical models (I’ll use the term algorithms through most of this post), and it facilitates (or might even be a key component of) an algorithm’s effectiveness if it relies on naive subjects. “These bonds have been rated by maths geniuses – buy them.” Unfairness relates to whether the algorithm operates in the interest of the modelled subject. Scale is not just that algorithms can affect large numbers of people. Scale can also lead to the establishment of norms that do not allow anyone to escape the operation of the algorithm.

These three factors are common across most of the problematic algorithms O’Neil discusses, and she makes a strong and persuasive case that many algorithms could be developed or used better. But the way she combines many of her points, together with her politics, often makes it unclear what exactly the problem is or what potential solutions could (should) be.

A distinction that might have made this clearer (or at least that I found useful) is between algorithms that don’t do what the developer intends, algorithms working as intended but that have poor consequences for those on the wrong side of their application, and algorithms that have unintended consequences once released into the wild. The first is botched math, the second is math done well to the detriment of others, while the third is good or bad math with naive application.

For this post I am going to break O’Neil’s case into these three categories.

Math done poorly

When it comes to the botched math, O’Neil is at her best. Her tale of teacher scoring algorithms in Washington DC is a case where the model is not helping anyone. Teachers were scored based on the deviations of student test scores from those predicted by models of the students. The bottom 2% to 5% of teachers were fired. But the combination of modelled target student scores and small classrooms made the scoring of teachers little better than random. There was almost no correlation in a teacher’s scores from one year to the next.

Her critique of the way many models are developed is also important. Are we checking the model is working, rather than just assuming that the teachers we fired are the right ones? She contrasts the effort typically spent testing a recidivism model (for use in determining prison sentences) to the way Amazon learns about its customers. Amazon doesn’t simply develop a single “recidivism score” equivalent and take that as determinative. Instead they continue to test and learn as much as they can about their interactions with customers to make the best models they can.

The solutions to the botched math are simpler (at least in theory) than many of the other problems she highlights. The teacher scoring models simply require someone with competence to consider what it is they might want to care about and measure, and if it can be done, work out whether it can be done in a statistically meaningful way. If it can’t, so be it. The willingness to concede that they can’t develop a meaningful model is important if that is the case, particularly if it is designed to inform high-stakes decisions. Similarly, recidivism scoring algorithms should be subject to constant scrutiny.

But this raises the question of how you assess an algorithm. What is the appropriate benchmark? Perfection? Or the system it is replacing? At times O’Neil places a heavy focus on the errors of the algorithm, with little focus on the errors of the alternative – the humans it replaced. Many of O’Neil’s stories involve false positives, leading to a focus on the obvious algorithm errors, with the algorithm’s greater accuracy and the human errors unseen. A better approach might be to simply compare alternative approaches and see which is better, rather than having the human as the default. Once the superior alternative is selected, we also need to remain cognisant that the best option still might not be very good.

As O’Neil argues, some of the poor models would also be less harmful if they were transparent. People could pull the models apart and see whether they were working as intended. A still cleaner version might be to just release the data and let people navigate it themselves (e.g. education data), although this is not without problems. Whatever is the most salient way of sorting and ranking will become the new defacto model. If we don’t do it ourselves, someone will take that data and give us the ranking we crave.

Math done well (for the user anyhow)

When comes to math done well, O’Neil’s three limbs of the WMD definition – opacity, unfairness and scale – are a good description of the problems she sees. O’Neil’s critique is usually not so much about the maths, but the unfair use of the models for purposes such as targeting of the poor (think predatory advertising by private colleges or payday lenders) or treating workers as cogs in the machine through the use of scheduling software.

In these cases, it is common that the person being modelled does not even know about the model (opacity). And if they could see the model, it may be hard to understand what characteristics are driving the outcome (although this is not so different to the opacity of human decision-making). The outcome then determines how we are treated, the ads we see, the prices we see, and so on.

One of O’Neil’s major concerns about fairness is that the models discriminate. She suggests they discriminate against the poor, African-Americans and those with mental illness. This is generally not through a direct intention to discriminate against these groups, although O’Neil gives the example of a medical school algorithm rejecting applicants based on birthplace due to biased training data. Rather, the models use proxies for the variables of interest, and those proxies also happen to correlate with certain group features.

This points to the tension in the use of many of these algorithms. Their very purpose is to discriminate. They are developed to identify the features that, say, employers or lenders want. Given there is almost always a correlation between those features and some groups, you will inevitably “discriminate” against them.

So what is appropriate discrimination? O’Neil objects to tarring someone with group features. If you live in a certain postcode, is it fair to be categorised with everyone else in that postcode? Possibly not. But if you have an IQ that is judged likely to result in poor job performance or creditworthiness based on the past performance of other people with that IQ, is that acceptable? What of having a degree?

The use of features such as postcodes, IQ or degrees come from the need to identify proxies for the traits people want to identify, such as whether they will pay back the loan or deliver good work performance. Each proxy varies in the strength of prediction, so the obvious solution seems to be to get more data and better proxies. Which of these is going to give us the best prediction of what we actually care about?

But O’Neil often balks at this step. She tells the story of a chap who can’t get minimum wage job due to his results on a five-factor model personality test, despite his “near perfect SAT”. The scale of the use of this test means he runs into this barrier with most employers. When O’Neil points out that personality is only one-third as predictive as cognitive tests, she doesn’t make the argument that employers should be allowed to use cognitive tests. She even suggests that employers are rightfully barred from using IQ tests in recruitment (as per a 1971 Supreme Court case). But absent the cognitive tests, do employers simply turn to the next best thing?

Similarly, when O’Neil complains about the use of “e-scores” (proxies for credit scores) in domains where entities are not legally allowed to use credit scores to discriminate, she complains that they are using a “sloppy substitute”. But again she does not complain about the ban on using the more direct measures.

There are also two sides to the use of these proxies. While the use of the proxies may result in some people being denied a job or a loan, it may allow someone else to get that job or loan, or to pay a better price, when a cruder measure might have seen that person being rejected.

O’Neil gives the example of ZestFinance, a payday lender that typically charges 60% lower than the industry standard. ZestFinance does this by finding every proxy for creditworthiness it can, picking out proxies such as correct use of capitalisation on the application form, and whether the applicant read the terms and conditions. O’Neil complains about those who are accepted for a loan but have to pay higher fees because of, say, poor spelling. This is something the poor and uneducated are more likely to incur. But her focus is on one type of outcome, those with more expensive loans (although probably still cheaper than from other payday lenders), leaving those people receiving the cheapest loans unseen. Should we deny this class of people the access to the cheaper finance these algorithms allow?

One interesting case in the book concerns the pricing of car insurance. An insurer wants to know who is the better driver, so they develop algorithms to price the risk appropriately. Credit scores are predictive of driving performance, so those with worse credit scores end up paying more for this.

But insurers also want to price discriminate to the extent that they can. That is, they want to charge each individual the highest price they will tolerate. Price discrimination can be positive for the poor. Price discrimination allows many airlines to offer cheap seats in the back of the plane when the business crowd insists on paying extra for a few inches of leg room. I benefited from the academic pricing of software for years, and we regularly see discounted pricing for students and seniors. But price discrimination can also allow the uninformed, lazy and those without options to be stripped of a few extra dollars. In the case of the insurer pricing algorithms, they are designed to price discriminate in addition to price the policy based on risk.

It turns out that credit score is not just predictive of driving performance, but also of buyer response to price changes. The resultant insurance pricing is an interaction of these two dimensions. O’Neil gives an example from Florida, where adults with clean driving records but poor credit scores paid $1,552 more (on average) than drivers with excellent credit but a drunk driving conviction, although it is unclear how much of this reflects risk and how much price discrimination.

Naive math

One of O’Neil’s examples of a what I will call naive math are those algorithms that create a self-reinforcing feedback loop. The model does what it is supposed to do – say, predict an event – but once used in a system, the model’s classification of a certain cohort becomes self-fulfilling or self-reinforcing.

For example, if longer prison sentences make someone more likely to offend on their release, any indicator that results in longer sentences will in effect become more strongly correlated with re-offending. Even if the model is updated to disentangle this problem, allowing the effect of the longer sentences to be isolated, the person who received a longer sentence is doomed the next time they are scored.

In a sense, the model does exactly what it should, predicting who will re-offend or not, and there is ample evidence that they do better than humans. But the application of the model does more than simply predicting recidivism. It might ultimately affirm itself.

Another example of a feedback loop is a person flagged as a poor credit risk. As they can’t get access to cheap credit, they then go to an expensive payday lender and ultimately run into trouble. That trouble is flagged in the credit scoring system, making it even harder for them to access financial services. If the algorithm made an error in the first instance – the person was actually a good credit risk – that person might then become a poor risk because the model effectively pushed them into more expensive products.

The solutions to these feedback loops are difficult. On the one hand, vigilant investigation and updating the models will help ameliorate the problems. O’Neil persuasively argues that we don’t do this enough. Entities such as ZestFinance that use a richer set of data can also break the cycle for some people.

But it is hard to solve the case for individual mis-classification. Any model will have false positives and false negatives. The model development process can only try to limit them, often with a trade-off between the two.

In assessing this problem we also need to focus on the alternative. Before these algorithms were developed, people would be denied credit, parole and jobs for all sorts of whimsical decisions on the part of the human decision makers. Those decisions would then result in feedback loops as their failures are reflected in future outcomes. The algorithms might be imperfect, but can be an improvement.

This is where O’Neil’s scale point becomes interesting. In a world of diverse credit scoring mechanisms, a good credit risk who is falsely identified as a poor risk under one measure might by accurately classified under another. The false positive is not universal, allowing them to shop around for the right deal. But if every credit provider uses the same scoring system, someone could be universally barred. The pre-algorithm world, for all its flaws, possibly provided more opportunities for someone to find the place where they are not incorrectly classified.

A final point on naive models (although O’Neil has more) is that models reflect goals and ideology. Sometimes this is uncontroversial – we want to keep dangerous criminals off the street. Sometimes this is more complicated – what risk of false positives are we willing to tolerate in keeping those criminals off the street? In many ways the influence of O’Neil’s politics on her critique provide the case in support of this point.

Solutions

Before reading the book, I listened to O’Neil on an Econtalk episode with Russ Roberts. There she makes the point that where we run into flawed algorithms, we shouldn’t always be going back to the old way of doing things (she made that comment in the context of judges). We should be making the algorithms better.

That angle was generally absent from the book. O’Neil takes the occasional moment to acknowledge that many algorithms are not disrupting perfect decision-making systems, but are replacing biased judges, bank managers who favoured their friends, and unstructured job interviews with no predictive power. But through the book she seems quite willing to rip those gains down in the name of fairness.

More explicitly, O’Neil asks whether we should sacrifice efficiency for fairness. For instance, should we leave some data out? In many cases we already do this, by not including factors such as race. But should this extend to factors such as who someone knows, their job or their credit score.

O’Neil’s choice of factors in this instance is telling. She asks whether someone’s connections, job or credit score should be used in a recidivism model, and suggests no as they would be inadmissible in court. But this is a misunderstanding of the court process. Those factors are inadmissible in determining guilt or innocence, but form a central part of sentencing decisions. Look at the use of referees or stories about someone’s tough upbringing. So is O’Neil’s complaint about the algorithm, or the way we dispense criminal justice in general? This reflects a feeling I had many times in the book that O’Neil’s concerns are much deeper than the effect of algorithms and extend to the nature of the systems themselves.

Possibly the point on which I disagree with O’Neil most is her suggestion that human decision-making has a benefit in that it can evolve and adapt. In contrast, a biased algorithm does not adapt until someone fixes it. The simple question I ask is where is the evidence of human adaptation? You just need to look at all the programs to eliminate workplace bias with no evidence of effectiveness for a taste of how hard it is to deliberately change people. We continue to be prone to seeing spurious correlations, and making inconsistent and unreliable decisions. For many human decisions there is simply no feedback loop as to whether we made the right decision. How will a human lender ever know they rejected a good credit risk?

While automated systems are stuck until someone fixes them, someone can fix them. And that is often what happens. Recently several people forwarded to me an article on the inability of some facial recognition systems to recognise non-Caucasian faces. But beyond the point that humans also have this problem (yes, “they all look alike“), the problem with facial recognition algorithms has been identified and, even though it is a tough problem, there are major efforts to fix it. (Considering some of the major customers of this technology are police and security services, there is an obvious interest in solving it.) In the meantime, those of us raised in a largely homogeneous population are stuck with our cross-racial face blindness.

Is it irrational?

Behavioral Scientist OwlOver at Behavioral Scientist magazine my second article, Rationalizing the ‘Irrational’, is up.

In the article I suggest that an evolutionary biology lens can give us some insight into what drives peoples’ actions. By understanding someone’s actual objectives, we are better able to determine whether their actions are likely to achieve their goals. Are they are behaving “rationally”?

Although the major thread of the article is evolutionary, in some ways that is not the main point. For me the central argument is simply that when we observe someone else’s actions, we need to exercise a degree of humility in assessing whether they are “rational”. We possibly don’t even know what they are trying to achieve, let alone whether their actions are the best way to achieve it.

Obviously, this new article pursues a somewhat different theme to my first in Behavioral Scientist, which explored the balance between human and algorithmic decision making. After discussing possible topics for my first article with the editor who has been looking after me to date (DJ Neri), I sent sketches of two potential articles. We decided to progress both.

My plan for my next article is to return to the themes from the first. I’ve recently been thinking and reading about algorithm aversion, and why we resist using superior decision tools when they are available. Even if the best solution is to simply use the algorithm or statistical output, the reality is that people will typically be involved. How can we develop systems where they don’t mess things up?

Kasparov’s Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins

KasparovIn preparation for my recent column in The Behavioral Scientist, which opened with the story of world chess champion Garry Kasparov’s defeat by the computer Deep Blue, I read Kasparov’s recently released Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins.

Despite the title and Kasparov’s interesting observations on the computer-human relationship, Deep Thinking is more a history of man versus machine in chess than a deep analysis of human or machine intelligence. Kasparov takes us from the earliest chess program, produced by Alan Turing on a piece of paper in 1952, through to a detailed account of Kasparov’s 1997 match against the computer Deep Blue, and then beyond.

Kasparov’s history provides an interesting sense of not just the process toward a machine defeating the world champion, but also when computers overtook the rest of us. In 1977 Kasparov had the machines ahead of all but the top 5% of humans. From the perspective of the average human versus machine, the battle is over decades before the machine is better than the best human. And even then the competition at the top levels is brief. As Kasparov puts it, we have:

Thousands of years of status quo human dominance, a few decades of weak competition, a few years of struggle for supremacy. Then, game over. For the rest of human history, as the timeline draws into infinity, machines will be better than humans at chess. The competition period is a tiny dot on the historical timeline.

As Kasparov also discusses, his defeat did not completely end the competition between humans and computers in chess. He describes a 1995 competition in what was called “freestyle chess”, whereby people were free to mix humans and machines as they see fit. To his surprise, the winners of this competition were not a grandmaster teamed with a computer, but a pair of amateur Americans using three computers at the same time. As Kasparov puts it, a weak human + machine + better process is superior to a strong human + machine + inferior process. There is still hope for the humans.

That hope, however, and the human-computer partnership, is also short-lived.  Kasparov notes that the algorithms will continue to improve and the hardware will get faster until the human partnership adds nothing to the mix. Kasparov’s position does not seem that different to my own.

One thing clear through Kasparov’s tale is that he does not consider chess to be the best forum for exploring machine intelligence. This was due to both the nature of chess itself, and the way in which those trying to develop a machine to defeat a world champion (particularly IBM) went about the task.

On the nature of chess, chess is just not complex enough. Its constraints – eight by eight board with sixteen pieces a side – meant that it was amenable to algorithms built using a combination of fixed human knowledge and brute force computational power. From the 1970s onward, developers of chess computers realised that this was the case, so much of the focus was on increasing computational power and refining algorithms for efficiency until they inevitably reach world champion standard.

The nature of these algorithms is best understood in the context of two search techniques described by Claude Shannon in 1949. Type A search is the process of going through every possible combination of moves deeper and deeper with each pass – one move deep, two moves deep and so on. The Type B search is more human-like, focusing on the few most promising moves and examining those in great depth. The development of Type B processes would provide more insight into machine intelligence.

The software that defeated Kasparov, along with most other chess software, used what Kasparov calls alpha-beta search. Alpha-beta search is a Type A approach that stops searching down any particular path whenever a move being examined has a lower value than the currently selected move. This process and increases in computational power were the keys to chess being vulnerable to the brute force attack. Although enormous amounts of work also went into Deep Blue’s openings and evaluation function, another few years would have seen Kasparov or his successor defeated by something far less highly tuned. His defeat was somewhat inevitable.

IBM’s approach to the contest also did not add much to the exploration of machine intelligence. As became clear to Kasparov in the lead up to the Deep Blue rematch (he had defeated Deep Blue in 1996), IBM was not interested in the science behind the enterprise, but simply wanted to win. It provided great advertising for IBM, but the machine logs of the contest were not made available and Deep Blue was later trashed. It’s an interesting contrast to IBM’s approach with Jeopardy winning Watson, which now seems to be everywhere.

As a result, Kasparov sees the AlphaGo project as a more interesting AI project than anything behind the top chess machines. The complexity of Go – a 19 by 19 board and 361 stones – requires the use of techniques such as neural networks. AlphaGo had to teach itself to play.

Even though Kasparov’s offerings on human and machine on intelligence are relatively thin, the chess history in itself makes the book worth reading. Kasparov’s story differs from some of the “myths” that have spread about that contest over the last 20 years, with Kasparov critical of many commentator’s interpretations of events.

One story Kasparov attacks is Nate Silver’s version in The Signal and the Noise (at which time Kasparov also takes a few swings at Silver’s understanding of chess). Silver’s story starts at the conclusion of game 1 of the match. When Kasparov considered his victory near complete, Deep Blue moved a rook in a highly unusual move – a move that turned out the be a “bug” in Deep Blue’s programming. As he did not understand it was a bug, Kasparov saw the move as a sign that the machine could see mate by Kasparov in 20 or more moves, and was seeking to delay this defeat. Kasparov was so impressed by the depth of Deep Blue’s calculations that it affected his play for the rest of the match and was the ultimate cause of his loss.

As Kasparov tells in his version, he simply discarded Deep Blue’s move as the type of inexplicable move computers tend to make when lost. Instead, his state of mind suffered most severely when he was defeated in game 2. Through game 2 he played an unnatural (to him) style of anti-computer chess, and overlooked a potential chance to draw the game through perpetual check (he was informed of his missed opportunity the next day). He simply wasn’t looking for opportunities that he thought a computer would have spotted.