Angela Duckworth’s Grit: The Power of Passion and Perseverance

DuckworthIn Grit: The Power of Passion and Perseverance, Angela Duckworth argues that outstanding achievement comes from a combination of passion – a focused approach to something you deeply care about – and perseverance – a resilience and desire to work hard. Duckworth calls this combination of passion and perseverance “grit”.

For Duckworth, grit is important as focused effort is required to both build skill and turn that skill into achievement. Talent plus effort leads to skill. Skill plus effort leads to achievement. Effort appears twice in the equation. If one expends that effort across too many domains (no focus through lack of passion), the necessary skills will not be developed and those skills won’t be translated into achievement.

While sounding almost obvious written this way, Duckworth’s claims go deeper. She argues that in many domains grit is more important than “talent” or intelligence. And she argues that we can increase people’s grit through the way we parent, educate, coach and manage.

Three articles from 2016 (in SlateThe New Yorker and npr) critiquing Grit and the associated research make a lot of the points that I would. But before turning to those articles and my thoughts, I will say that Duckworth appears to be one of the most open recipients of criticism in academia that I have come across. She readily concedes good arguments, and appears caught between her knowledge of the limitations of the research and the need to write or speak in a strong enough manner to sell a book or make a TED talk.

That said, I am sympathetic with the Slate and npr critiques. Grit is not the best predictor of success. To the extent there is a difference between “grit” and the big five trait of conscientiousness, it is minor (making grit largely an old idea rebranded with a funkier name). A meta-analysis (working paper) by Marcus Credé, Michael Tynan and Peter Harms makes this case (and forms the basis of the npr piece).

Also critiqued in the npr article is Duckworth’s example of grittier cadets being more likely to make it through the seven-week West Point training program Beast Barracks, which features in the book’s opening. As she states, “Grit turned out to be an astoundingly reliable predictor of who made it through and who did not.”

The West Point research comes from two papers by Duckworth and colleagues from 2007 (pdf) and 2009 (pdf). The difference in drop out rate is framed as a rather large in the 2009 article:

“Cadets who scored a standard deviation higher deviation higher than average on the Grit-S were 99% more likely to complete summer training”

But to report the results another way, 95% of all cadets made it through. 98% of the top quartile in grit stayed. As Marcus Credé states in the npr article, there is only a three percentage point difference between the average drop out rate and that of the grittiest cadets. Alternatively, you can consider that 88% of the bottom quartile made it through. That appears a decent success rate for these low grit cadets. (The number reported in the paper references the change in odds, which is not the way most people would interpret that sentence. But on Duckworth being a great recipient of criticism, she concedes in the npr article she should have put it another way.)

Having said this, I am sympathetic to the argument that there is something here that West Point could benefit from. If low grit were the underlying cause of cadet drop-outs, reducing the drop out rate of the least gritty half to that of the top half could cut the drop out rate by more than 50%. If they found a way of doing this (which I am more sceptical about), it could be a worthwhile investment.

One thing that I haven’t been able to determine from the two papers with the West Point analysis is the distribution of grit scores for the West Point cadets. Are they gritty relative to the rest of the population? In Duckworth’s other grit studies, the already high achievers (spelling bee contestants, Stanford students, etc.) look a lot like the rest of us. Why does it take no grit to enter into domains which many people would already consider to be success? Is this the same for West Point?

Possibly the biggest question I have about the West Point study is why people drop out. As Duckworth talks about later in the book (repeatedly), there is a need to engage in search to find the thing you are passionate about. Detours are to be expected. When setting top-level goals, don’t be afraid to erase an answer that isn’t working out. Finishing what you begin could be a way to miss opportunities. Be consistent over time, but first find a thing to be consistent with. If your mid-level goals are not aligned with your top level objective, abandon them. And so on. Many of the “grit paragons” that Duckworth interviewed for her book explored many different avenues before settling on the one that consumes them.

So, are the West Point drop-outs leaving because of low grit, or are they are shifting to the next phase of their search? If we find them later in their life (at a point of success), will they then score higher on grit as they have found something they are passionate about that they wish to persevere with? How much of the high grit score of the paragons is because they have succeeded in their search? To what extent is grit simply a reflection of current circumstances?

One of the more interesting sections of the book addresses whether there are limits to what we can achieve due to talent. Duckworth’s major point is that we are so far from whatever limits we have that they are irrelevant.

On the one hand, that is clearly right – in almost every domain people could improve through persistent effort (and deliberate practice). But another consideration is where their personal limits lie relative to the degree of skill required to successfully achieve a person’s goals. I am a long way from my limits as a tennis player, but my limits are well short of that required to ever make a living from it.

Following from this, Duckworth is of the view that people should follow their passion and argues against the common advice that following your passion is the path to poverty. I’m with Cal Newport on this one, and think that “follow your passion” is horrible advice. If you don’t have anything of value to offer related to your passion, you likely won’t succeed.

Duckworth’s evidence behind her argument is mixed. She notes that people are more satisfied with jobs when they follow a personal interest, but this is not evidence that people who want to find a job that matches their interest are more satisfied. Where are those who failed? Duckworth also notes that these people perform better, but again, what is the aggregate outcome of all the people who started out with this goal?

One chapter concerns parenting. Duckworth concedes the evidence here is thin, incomplete and that there are no randomised controlled trials. But she then suggests that she doesn’t have time to wait for the data come in (which I suppose you don’t if you are already raising children).

She cites research on supportive versus demanding parenting, derived from measures such as surveys of students. These demonstrate that students with more demanding parents have higher grades. Similarly, research on world-class performers shows that their parents are models of work ethic. The next chapter reports on the positive relationship between extracurricular activities while at school and job outcomes, particularly where they stick with the same activity for two or more years (i.e. consistent parents).

But Duckworth does not address the typical problem of studies in this domain – they all ignore biology. Do the students receive higher grades because their parents are more demanding, or because they are the genetic descendants of two demanding people? Are they world-class performers because their parents model a work ethic, or because they have inherited a work ethic? Are they consistent with their extracurricular activities because their parents consistently keep them at it, or because they are the type of people likely to be consistent?

These questions might appear speculation in themselves, but the large catalogue of twin, adoption and now genetic studies points to the answers. To the degree children resemble their parents, this is largely genetic. The effect of the shared environment – i.e. parenting – is low (and in many studies zero). That is not say interventions cannot be developed. But they are not reflected in the variation in parenting the subject of these studies.

Duckworth does briefly turn to genetics when making her case for the ability to change someone’s grit. Like a lot of other behavioural traits, the heritability of grit is moderate: 37% for perseverance, 20% for passion (the study referenced is here). Grit is not set in stone, so Duckworth takes this as a case for the effect of environment.

However, a heritability less than one provides little evidence that deliberate changes in environment can change a trait. The same study finding moderate heritability also found no effect of shared environment (e.g. parenting). The evidence of influence is thin.

Finally, Duckworth cites the Flynn effect as evidence of the malleability of IQ – and how similar effects could play out with grit – but she does not reference the extended trail of failed interventions designed to increase IQ (although a recent meta-analyses show some effect of education). I can understand Duckworth’s aims, but feel that the literature in support of them is somewhat thin.

Other random points or thoughts:

  • As for any book that contain colourful stories of success linked to the recipe it is selling, the stories of the grit paragons smack of survivorship bias. Maybe the coach of the Seattle Seahawks pushes toward a gritty culture, but I’m not sure the other NFL teams go and get ice-cream every time training gets tough. Jamie Dimon, CEO of JP Morgan, is praised for the $5 billion profit JP Morgan gained through the GFC (let’s skate over the $13 billion in fines). How would another CEO have gone?
  • Do those with higher grit display a higher level of sunk cost fallacy, being unwilling to let go?
  • Interesting study – Tsay and Banaji, Naturals and strivers: Preferences and beliefs about sources of achievement. The abstract:

To understand how talent and achievement are perceived, three experiments compared the assessments of “naturals” and “strivers.” Professional musicians learned about two pianists, equal in achievement but who varied in the source of achievement: the “natural” with early evidence of high innate ability, versus the “striver” with early evidence of high motivation and perseverance (Experiment 1). Although musicians reported the strong belief that strivers will achieve over naturals, their preferences and beliefs showed the reverse pattern: they judged the natural performer to be more talented, more likely to succeed, and more hirable than the striver. In Experiment 2, this “naturalness bias” was observed again in experts but not in nonexperts, and replicated in a between-subjects design in Experiment 3. Together, these experiments show a bias favoring naturals over strivers even when the achievement is equal, and a dissociation between stated beliefs about achievement and actual choices in expert decision-makers.”

  • A follow up study generalised the naturals and strivers research over some other domains.
  • Duckworth reports on the genius research of Catharine Cox, in which Cox looked at 300 eminent people and attempted to determine what it was that makes them a genius. All 300 had an IQ above 100. The average of the top 10 was 146. The average of the bottom 10 was 143. Duckworth points to the trivial link between IQ and ranking within that 300, with the substantive differentiator being level of persistence. But note those average IQ scores…

Dealing with algorithm aversion

Over at Behavioral Scientist is my latest contribution. From the intro:

The first American astronauts were recruited from the ranks of test pilots, largely due to convenience. As Tom Wolfe describes in his incredible book The Right Stuff, radar operators might have been better suited to the passive observation required in the largely automated Mercury space capsules. But the test pilots were readily available, had the required security clearances, and could be ordered to report to duty.

Test pilot Al Shepherd, the first American in space, did little during his first, 15-minute flight beyond being observed by cameras and a rectal thermometer (more on the “little” he did do later). Pilots rejected by Project Mercury dubbed Shepherd “spam in a can.”

Ham_the_chimp_(cropped)

Astronaut Ham.

Other pilots were quick to note that “a monkey’s gonna make the first flight.” Well, not quite a monkey. Before Shepherd, the first to fly in the Mercury space capsule was a chimpanzee named Ham, only 18 months removed from his West African home. Ham performed with aplomb.

But test pilots are not the type to like relinquishing control. The seven Mercury astronauts felt uncomfortable filling a role that could be performed by a chimp (or spam). Thus started the astronauts’ quest to gain more control over the flight and to make their function more akin to that of a pilot. A battle for decision-making authority—man versus automated decision aid—had begun.

Head on over to Behavioral Scientist to read the rest.

While the article draws quite heavily on Tom Wolfe’s The Right Stuff, the use of the story of the Mercury astronauts was somewhat inspired by Charles Perrow’s Normal Accidents. Perrow looks at the two sides of the problems that emerged during the Mercury missions – the operator error, which formed the opening of my article, and the designer error, which features in the close.

One issue that became apparent to me during drafting was the distinction between an algorithm determining a course of action, and the execution of that action through mechanical, electronic or other means. The example of the first space flights clearly has this issue. Many of the problems were not that the basic calculations (the algorithms) were faulty. Rather, the execution failed. In early drafts of the article I tried to draw this distinction out, but it made the article clunky. I ultimately reduced this point to a mention in the close. It’s something I might explore at a later time, because I suspect “algorithm aversion” when applied to self-driving cars relates to both decision making and execution.

Another issue that became stark was the limit of the superiority of algorithms. In the first draft, I did not return to the Mercury missions for the close. It was a easy to talk of bumbling humans in the first space flights and how to guide them toward better use of algorithms. But that story was too neat, particularly given the particular example I had chosen. During the early flights there were plenty of times where the astronauts had to step in and save themselves. Perhaps if I had used a medical diagnosis or more typical decision scenario in the opening I could have written a cleaner article.

Regardless, the mix of operator and designer error (to use Perrow’s framing) has led me down a path of exploring how to use algorithms when the decision is idiosyncratic or is being made in a less developed system. The early space flights are one example, but strategic business decisions might be another. What is the right balance of algorithms and humans there? At this point, I’m planning for that to be the focus of my next Behavioral Scientist piece.

Dan Ariely’s Payoff: The Hidden Logic That Shapes Our Motivations

ArielyIf you have read Dan Ariely’s The Upside of Irrationality, there will be few surprises for you in his TED book Payoff: The Hidden Logic That Shapes Our Motivations. TED books are designed to be slightly longer explorations of topics from TED talks, but short enough to be read in one sitting. That makes it an easy, enjoyable, but not particularly deep read, with most of the results covered in The Upside. (Ariely’s TED talk can be viewed at the bottom of this post.)

The focus of Payoff is how we are motivated in the workplace, how easy it is to kill that motivation, and why we value the things we have made ourselves. It also touches on (in a slightly out-of-place and underdeveloped final chapter) how our actions are affected by what people will think about us after death.

Like The Upside of Irrationality, Ariely sways between interesting experimental results and not particularly convincing riffs on their application to the real world. Take the following example (the major experimental result that appears unique to Payoff). Workers in a semi-conductor plant in Israel were sent a message on day one of their four-day work stretch offering one of the following incentives if they met their target for the day:

  • A $30 bonus
  • A pizza voucher
  • A thank you text message from the boss
  • No message (the control group)

For people who were offered one of the three incentives, there was a boost to productivity on that day relative to the control: 4.9% for the cash group, 6.7% for the pizza group, and 6.6% for the thank you group.

The more interesting result was over the next three days. On day two, the group that had been incentivised with cash on day one had their productivity drop to 13.2% less than the control group. Absent the cash reward, they took their foot off the gas. On day three productivity was 6.2% worse. And on day four it was 2.9% worse. Over the four days, the productivity of the cash incentive group was 6.5% below that of the control. In contrast, the thank you group had no crash in productivity, with the pizza group somewhere in between. It seems the cash reward on day one, but not the other days, had sent a signal that day one was the only day when production mattered. Or the cash reward displaced some other form of motivation. What exactly is unclear.

Ariely turns the result into an attack on the idea that people work for pay and that more compensation will result in greater output. This is where Ariely’s riff and my take on the experimental results part.

I agree that there is more to work than merely the exchange of money for labour. Poorly designed incentives can backfire. You can crush motivation despite paying well. The way an incentive is designed can magnify or destroy its effect.

But Ariely sells the cash incentive short by making almost no comment on alternative designs. What if the bonus persisted, rather than being in place for only one day? How would a daily cash incentive perform against a canned thank you every day? What would productivity look like after a year?

I suspect Ariely is over-interpreting a narrow finding. The experiment was designed to demonstrate the poor structure of the existing incentive (the $30 bonus on day one) and to elicit an interesting effect, not to determine the best incentive structure. You only need to look at the overly creative ways people use to meet incentivised sales targets in financial services (e.g. Wells Fargo) to get a sense of how strongly people can be motivated by monetary bonuses. (Whether that is a good thing for the business is another matter. And to be honest, I haven’t actually checked that the Wells Fargo staff weren’t creating these fake accounts to receive more thank yous.)

So yes, think of motivation as being about more than money. Test whatever incentive systems you put in place. Test them over the long-term. But don’t start paying your staff in thank yous just yet.

Of those experiments reported in the Upside of Irrationality and repeated in Payoff, one of the more interesting is the destruction of motivation in a pointless task. People were paid to construct Lego Bionicles at a decreasing pay scale. After constructing one, they were then asked if they would like to construct another at a new lower rate. These people were grouped into two conditions. In one, their recently completed Bionicle would be placed to the side. In the other, the Bionicle would be destroyed in front of them and placed back into the box (the Sisyphic condition).

Those who saw their creations destroyed constructed less. Most notably, the decline in productivity in the Sisyphic group was strongest among those who liked making Bionicles, reducing their productivity to the level of those who couldn’t care less.

Other random thoughts on the book:

  • Ariely suggests that we value our food, gardens and houses less by getting others to take care of them for us, and suggests we should invest more ourselves (related to the IKEA effect). But what would the opportunity cost of this investment be?
  • Ariely takes a number of unfair pokes at Adam Smith and his story of the pin factory. Ariely suggests that specialisation and trade will destroy motivation as the person cannot see the whole (a la Marx), and that Smith’s idea is no longer relevant. I trust he makes his own pins.
  • One scenario where I felt the opposite inclination to Ariely was the following:

Imagine, for example, that you worked for me and I asked you to stay late three times over the next week to help complete a project ahead of deadline. At the end of the week, you will not have seen your family but will have come close to a caffeine overdose. As an expression of my gratitude I present you with one of two rewards. In option one, I tell you how much your extra hard work meant to me. I give you a warm and sincere hug and invite you and your family to dinner. In option two, I tell you that I have calculated your marginal contribution to the company’s bottom line, it totaled $27.800, and I tell you that I will give you a bonus of 5 percent of this amount ($1,390). Which scenario is more likely to maximise your goodwill toward the company and me, not just on that day, but moving forward? Which will inspire you to push extra hard to meet the next deadline?

AI in medicine: Outperforming humans since the 1970s

From an interesting a16z podcast episode Putting AI in Medicine, in Practice (I hope I got the correct names against who is saying what):

Mintu Turakhia (cardiologist at Stanford and Director of the Centre for Digital Health): AI is not new to medicine. Automated systems in healthcare have been described since the 1960s. And they went through various iterations of expert systems and neural networks and called many different things.

Hanne Tidnam: In what way would those show up in the 60s and 70s.

Mintu Turakhia: So at that time there was no high resolution, there weren’t too many sensors, and it was about a synthetic brain that could take what a patient describes as the inputs and what a doctor finds on the exam as the inputs.

Hanne Tidnam: Using verbal descriptions?

Mintu Turakhia: Yeah, basically words. People created, you know, what are called ontologies and classification structures. But you put in the ten things you felt and a computer would spit out the top 10 diagnoses in order of probability and even back then, they were outperforming sort of average physicians. So this is not a new concept.

This point about “average physicians” is interesting. In some circumstances you might be able to find someone who outperforms the AI. The truly extraordinary doctor. But most people are not treated by that star.

They continue:

Brandon Ballinger (CEO and founder of Cardiogram): So an interesting case study is the Mycin system which is from 1978 I believe. And so, this was an expert system trained at Stanford. It would take inputs that were just typed in manually and it would essentially try to predict what a pathologist would show. And it was put to the test against five pathologists. And it beat all five of them.

Hanne Tidnam: And it was already outperforming.

Brandon Ballinger: And it was already outperforming doctors, but when you go to the hospital they don’t use Mycin or anything similar. And I think this illustrates that sometimes the challenge isn’t just the technical aspects or the accuracy. It’s the deployment path, and so some of the issues around there are, OK, is there convenient way to deploy this to actual physicians. Who takes the risk? What’s the financial model for reimbursement? And so if you look at the way the financial incentives work there are some things that are backwards, right. For example, if you think about kindof a hospital from the CFO’s perspective, misdiagnosis actually earns them more money because when you misdiagnose you do follow up tests, right, and those, and our billing system is fee for service, so every little test that’s done is billed for.

Hanne Tidnam: But nobody wants to be giving out wrong diagnoses. So where is the incentive. The incentive is just in the system, the money that results from it.

Brandon Ballinger: No-one wants to give incorrect diagnosis. On the other hand there’s no budget to invest in making better diagnosis. And so I think that’s been part of the problem. And things like fee for value are interesting because now you’re paying people for, say, an accurate diagnosis, or for a reduction in hospitalisations, depending on the exact system, so I think that’s a case where accuracy is rewarded with greater payment, which sets up the incentives so that AI can actually win in this circumstance.

Vijay Pande (a16z General Partner): Where I think AI has come back at us with a force is it came to healthcare as as a hammer looking for a nail. What we’re trying to figure out is where you can implement it easily and safely with not too much friction and with not a lot of physicians going crazy, and where it’s going to be very very hard.

For better diagnoses, I’d be willing to drive a few physicians crazy.

The section on the types of error was also interesting:

Mintu Turakhia: There may be a point that it truly outperforms the cognitive abilities of physicians, and we have seen that with imaging so far. And some of the most promising aspects of the imaging studies and the EKG studies are that the confusion matrices, the way humans misclassify things, is recapitulated by the convolutional neural networks. …

A confusion matrix is a way to graph the errors and which directions they go. And so for rhythms on an EKG, a rhythm that’s truly atrial fibrillation could get classified as normal sinus rhythm, or atrial tachycardia, or super-ventricular tachycardia, the names are not important. What’s important is that the algorithms are making the same type of mistakes that humans are doing. It’s not that its making a mistake that’s necessarily more lethal, and just nonsensical so to speak. It recapitulates humans. And to me that’s the core thesis of AI in medicine, because if you can show that you are recapitulating human error, you’re not going to make it perfect, but that tells you that, in check and with control, you can allow this to scale safely since its liable to do what humans do. ….

Hanne Tidnam: And so you’re just saying it doesn’t have to be better. It just has to be making the same kinds of mistakes to feel that you can trust the decision maker.

Mintu Turakhia: Right. And you dip your toe in the water by having it be assistive. And then at some point we as a society will decide if it can go fully auto, right, fully autonomous without a doctor in the loop. That’s a societal issue. That’s not a technical hurdle at this point.

Certainly a heavy bias to the status quo. I’d certainly prefer something with better net performance even if some of the mistakes are different.

Is there a “backfire effect”?

I saw the answer hinted at in a paper released mid last-year (covered on WNYC), but Daniel Engber has now put together a more persuasive case:

Ten years ago last fall, Washington Post science writer Shankar Vedantam published an alarming scoop: The truth was useless.

His story started with a flyer issued by the Centers for Disease Control and Prevention to counter lies about the flu vaccine. The flyer listed half a dozen statements labeled either “true” or “false”—“Not everyone can take flu vaccine,” for example, or “The side effects are worse than the flu” —along with a paragraph of facts corresponding to each one. Vedantam warned the flyer’s message might be working in reverse. When social psychologists had asked people to read it in a lab, they found the statements bled together in their minds. Yes, the side effects are worse than the flu, they told the scientists half an hour later. That one was true—I saw it on the flyer.

This wasn’t just a problem with vaccines. According to Vedantam, a bunch of peer-reviewed experiments had revealed a somber truth about the human mind: Our brains are biased to believe in faulty information, and corrections only make that bias worse.

These ideas, and the buzzwords that came with them—filter bubbles, selective exposure, and the backfire effect—would be cited, again and again, as seismic forces pushing us to rival islands of belief.

Fast forward a few years:

When others tried to reproduce his the research [Ian Skurnik’s vaccine research], though, they didn’t always get the same result. Kenzie Cameron, a public health researcher and communications scholar at Northwestern’s Feinberg School of Medicine, tried a somewhat similar experiment in 2009. … “We found no evidence that presenting both facts and myths is counterproductive,” Cameron concluded in her paper, which got little notice when it was published in 2013.

There have been other failed attempts to reproduce the Skurnik, Yoon, and Schwarz finding. For a study that came out last June, Briony Swire, Ullrich Ecker, and “Debunking Handbook” co-author Stephan Lewandowsky showed college undergrads several dozen statements of ambiguous veracity (e.g. “Humans can regrow the tips of fingers and toes after they have been amputated”).  … But the new study found no sign of this effect.

And on science done right (well done Brendan Nyhan and Jason Reifler):

Brendan Nyhan and Jason Reifler described their study, called “When Corrections Fail,” as “the first to directly measure the effectiveness of corrections in a realistic context.” Its results were grim: When the researchers presented conservative-leaning subjects with evidence that cut against their prior points of view—that there were no stockpiled weapons in Iraq just before the U.S. invasion, for example—the information sometimes made them double-down on their pre-existing beliefs. …

He [Tom Wood] and [Ethan] Porter decided to do a blow-out survey of the topic. Instead of limiting their analysis to just a handful of issues—like Iraqi WMDs, the safety of vaccines, or the science of global warming—they tried to find backfire effects across 52 contentious issues. … They also increased the sample size from the Nyhan-Reifler study more than thirtyfold, recruiting more than 10,000 subjects for their five experiments.

In spite of all this effort, and to the surprise of Wood and Porter, the massive replication effort came up with nothing. That’s not to say that Wood and Porter’s subjects were altogether free of motivated reasoning.

The people in the study did give a bit more credence to corrections that fit with their beliefs; in those situations, the new information led them to update their positions more emphatically. But they never showed the effect that made the Nyhan-Reifler paper famous: People’s views did not appear to boomerang against the facts. Among the topics tested in the new research—including whether Saddam had been hiding WMDs—not one produced a backfire.

Nyhan and Reifler, in particular, were open to the news that their original work on the subject had failed to replicate. They ended up working with Wood and Porter on a collaborative research project, which came out last summer, and again found no sign of backfire from correcting misinformation. (Wood describes them as “the heroes of this story.”) Meanwhile, Nyhan and Reifler have found some better evidence of the effect, or something like it, in other settings. And another pair of scholars, Brian Schaffner and Cameron Roche, showed something that looks a bit like backfire in a recent, very large study of how Republicans and Democrats responded to a promising monthly jobs report in 2012. But when Nyhan looks at all the evidence together, he concedes that both the prevalence and magnitude of backfire effects could have been overstated and that it will take careful work to figure out exactly when and how they come in play.

Read Engber’s full article. It covers a lot more territory, including some interesting history on how the idea spread.

I have added this to the growing catalogue of readings on my critical behavioural economics and behavioural science reading list. (Daniel Engber makes a few appearances.)

Benartzi (and Lehrer’s) The Smarter Screen: Surprising Ways to Influence and Improve Online Behaviour

BenartziThe replication crisis has ruined my ability to relax while reading a book built on social psychology foundations. The rolling sequence of interesting but small sample and possibly not replicable findings leaves me somewhat on edge. Shlomo Benartzi’s (with Jonah Lehrer) The Smarter Screen: Surprising Ways to Influence and Improve Online Behavior (2015) is one such case.

Sure, I accept there is a non-zero probability that a 30 millisecond exposure to the Apple logo could make someone more creative than exposure to the IBM logo. Closing a menu after making my choice might make me more satisfied by giving me closure. Reading something in Comic Sans might lead me to think about it in a different way. But on net, most of these interesting results won’t hold up. Which? I don’t know.

That said, like a Malcolm Gladwell book, The Smarter Screen does have some interesting points and directed me to plenty of interesting material elsewhere. Just don’t bet your house on the parade of results being right.

The central thesis in The Smarter Screen is that since so many of our decisions are now made on screens, we should invest more time in designing these screens for better decision making. Agreed.

I saw Benartzi present about screen decision-making a few years ago, when he highlighted how some biases play out differently on screens compared to other mediums. For example, he suggested that defaults were less sticky on screens (we are quick to un-check the pre-checked box). While that particular example didn’t appear in The Smarter Screen, other examples followed a similar theme.

As a start, we read much faster on screens. Benartzi gives the example of a test with a written instruction at the front of the test to not answer the following questions. Experimental subjects suffered double rate of failure when on a computer – up from around 20% to 46% – skipping over the instruction and answering questions they should not have answered.

People are also more truthful on screens. For instance, people report more health problems and drug use to screens. Men report less sexual partners, women more. We order pizza closer to our preferences (no embarrassment about those idiosyncratic tastes).

Screens can also exacerbate biases as the digital format allows for more extreme environments, such as massive ranges of products. The thousands of each type of pen on Amazon or the maze of healthcare plans on HealthCare.gov are typically not seen in stores or in hard copy.

The choice overload experienced on screens is a theme through the book, with many of Benartzi’s suggestions focused on making the choice manageable. Use categories to break up the choice. Use tournaments where small sets of comparisons are presented and the winners face off against each other (do you need to assume transitivity of preferences for this to work?). All sound suggestions worth trying.

One interesting complaint of Benartzi’s is about Amazon’s massive range. They have over 1,000 black roller-ball pens! An academic critiquing one of the world’s largest companies built on offering massive choice (and with a reputation for A/B testing) is somewhat circumspect. Maybe Amazon could be even bigger? (Interestingly, after critiquing Amazon for not allowing “closure” and reducing satisfaction by suggesting similar products after purchase, Benartzi suggests Amazon already knows this issue).

The material on choice overload reflects Benartzi’s habit through the book of giving a relatively uncritical discussion of his preferred underlying literature. Common examples such as the jam experiment are trotted out, with no mention of the failed replications or the meta-analysis showing a mean effect of changing the number of choices of zero. Benartzi’s message that we need to test these ideas covers him to a degree, but a more sceptical reporting of the literature would have been helpful.

Some other sections have a similar shallowness. The material on subliminal advertising ignores the debates around it. Some of the cited studies have all the hallmarks of a spurious result, with multiple comparisons and effects only under specific conditions. For example, people are more likely to buy Mountain Dew if the Mountain Dew ad played at 10 times speed is preceded by an ad for a dissimilar product like a Honda. There is no effect when an ad for a (similar) Hummer is played first. Really?

Or take disfluency and the study by Adam Alter and friends. Forty students were exposed to two versions of the cognitive reflection task. A typical question in the cognitive reflection task is the following:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?

The two versions differed in that one used a small light grey font that made the questions hard to read. Those exposed to the harder to read questions achieved higher scores. Exciting stuff

But 16 replications involving a total of around 7,000 people found nothing (Terry Burnham discusses these replications in more detail here). Here’s how Benartzi deals with the replications:

It’s worth pointing out, however, that not every study looking at disfluent fonts gets similar results. For reasons that remain unclear, many experiments have found little to no effect when counterintuitive math problems, such as those in the CRT, are printed in hard-to-read letters. While people take longer to answer the questions, this extra time doesn’t lead to higher scores. Clearly, more research is needed.

What is Benartzi’s benchmark for accepting that a cute experimental result hasn’t stood up to further examination and that we can move on to more prospective research? Sixteen studies involving 7,000 people in total showing no effect, one study with 40 people showing a result. The jury is still out?

One feeling I had at the end of the book was that the proposed solutions were “small”. Behavioural scientists are often criticised for proposing small solutions, which is generally unfair given the low cost of many of the interventions. The return on investment can be massive. But the absence of new big ideas at the close of the book raised the question (at least for me) of where the next big result can be.

Benartzi was, of course, at the centre of one of the greatest triumphs in the application of behavioural science – the Save More Tomorrow plan he developed with Richard Thaler. Many of the other large successful applications of behavioural science rely on the same mechanism, defaults.

So when Benartzi’s closing idea is to create an app for smartphones to increase retirement saving, it feels slightly underwhelming. The app would digitally alter portraits of the user to make them look old and help relate them to their future self. The app would make saving effortless through pre-filled information and the like. Just click a button. But you first have to get people to download it. What is the marginal effect on these people already motivated enough to download the app? (Although here is some tentative evidence that at least among certain cohorts this effect is above zero.)

Other random thoughts:

  • One important thread through the book is the gap between identifying behaviours we want to change and changing them. Feedback is simply not enough. Think of a bathroom scale. It is cheap, available, accurate, and most people have a good idea of their weight. Bathroom scales haven’t stopped the increase in obesity.
  • Benartzi discusses the potential of query theory, which proposes that people arrive at decisions by asking themselves a series of internal questions. How can we shape decisions by posing the questions externally?
  • Benartzi references a study in which 255 students received an annual corporate report. One report was aesthetically pleasing, the other less attractive. Despite both reports containing the same information, the students gave a higher valuation for the company with the attractive report (more than double). Bernartzi suggests the valuations should have been the same, but I am not sure. In the same way that wasteful advertising can be a signal that the brand has money and will stick around, the attractive report provides a signal about the company. If a company doesn’t have the resources to make its report look decent, how much should you trust the data and claims in it?
  • Does The Smarter Screen capture a short period where screens have their current level of importance? Think of ordering a pizza. Ten years ago we might have phoned, been given an estimated time of delivery and then waited. Today we can order our pizza on our smartphone, then watch it move through the process of construction, cooking and delivery. Shortly (if you’re not already doing this), you’ll simply order your pizza through your Alexa.
  • Benartzi discusses how we could test people through a series of gambles to determine their loss aversion score. When people later face decisions, an app with knowledge of their level of loss aversion could help guide their decision. I have a lot of doubt about the ability to get a specific, stable and useful measure of loss aversion for a particular person, and am a fan of the approach of Greg Davies to the bigger question of how we should consider attitudes to risk and short-term behavioural responses.
  • In the pointers at the end of one of the chapters, Benartzi asks “Are you trusting my advice too much? While there is a lot of research to back up my recommendations, it is equally important to test the actual user experience and choice quality and adjust the design accordingly.” Fair point!

Best books I read in 2017

The best books I read in 2017 – generally released in other years – are below (in no particular order). Where I have reviewed, the link leads to that review.

Don Norman’s The Design of Everyday Things (2013): In a world where so much attention is on technology, a great discussion of the need to consider the psychology of the users.
David Epstein’s The Sports Gene: Inside the Science of Extraordinary Athletic Performance (2013): The best examination of nature versus nurture as it relates to performance that I have read. I will write about The Sports Gene some time in 2018.
Cathy O’Neil’s Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (2016) – Although O’Neil is too quick to turn back to all-too-flawed humans as the solution to problematic algorithms, her critique has bite.
Kasparov’s Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins (2017) – Deep Thinking does not contain much deep analysis of human versus machine intelligence, but the story of Kasparov’s battle against Deep Blue is worth reading.
Gerd Gigerenzer, Peter Todd and the ABC Research Group’s Simple Heuristics That Make Us Smart (1999) – A re-read for me (and now a touch dated), but a book worth revisiting.
Pedro Domingos The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World (2015) – On the list for the five excellent chapters on the various “tribes” of machine learning. The rest is either techno-Panglossianism or beyond my domain of expertise to assess.
Christian and Griffiths’s Algorithms to Live By: The Computer Science of Human Decisions (2016) – An excellent analysis of decision making, with the benchmark the solutions of computer science. As they say, “the best algorithms are all about doing what makes the most sense in the least amount of time, which by no means involves giving careful consideration to every factor and pursuing every computation to the end.”
William Finnegan’s Barbarian Days: A Surfing Life – Simply awesome, although I suspect of more interest to surfers (that said, it did win a Pulitzer). I also read a lot of great fiction during the year. Fahrenheit 451 and The Dice Man were among those I enjoyed the most.

Psychology as a knight in shining armour, and other thoughts by Paul Ormerod on Thaler’s Misbehaving

I have been meaning to write some notes on Richard Thaler’s Misbehaving: The Making of Behavioral Economics for some time, but having now come across a review by Paul Ormerod (ungated pdf) – together with his perspective on the position of behavioural economics in the discipline – I feel somewhat less need. Below are some interesting sections of Ormerod’s review.

First, on the incorporation of psychology into economics:

With a few notable exceptions, psychologists themselves have not engaged with the area. ‘Behavioral economics has turned out to be primarily a field in which economists read the work of psychologists and then go about their business of doing research independently’ (p. 179). One reason for this which Thaler gives is that few psychologists have any attachment to the rational choice model, so studying deviations from it is not interesting. Another is that ‘the study of “applied” problems in psychology has traditionally been considered a low status activity’ (p. 180).

It is fashionable in many social science circles to deride economics, and to imagine that if only these obstinate and ideological economists would import social science theories into the discipline, all would be well. All manner of things would be well, for somehow these theories would not only be scientifically superior, but their policy implications would lead to the disappearance of all sorts of evils, such as austerity and even neo-liberalism itself. This previous sentence deliberately invokes a caricature, but one which will be all too recognisable to economists in Anglo-Saxon universities who have dealings with their colleagues in the wider social sciences.

A recent article in Science (Open Science Collaboration 2015) certainly calls into question whether psychology can perform this role of knight in shining armour. A team of no fewer than 270 co-authors attempted to replicate the results of 100 experiments published in leading psychology journals. … [O]nly 36 per cent of the attempted replications led to results which were statistically significant. Further, the average size of the effects found in the replicated studies was only half that reported in the original studies. …

Either the original or the replication work could be flawed, or crucial differences between the two might be unappreciated. … So the strategy adopted by behavioural economists of choosing for themselves which bits of psychology to use seems eminently sensible.

On generalising behavioural economics:

The empirical results obtained in behavioural economics are very interesting and some, at least, seem to be well established. But the inherent indeterminacy discussed above is the main reason for unease with the area within mainstream economics. Alongside Misbehaving, any economist interested in behavioural economics should read the symposium on bounded rationality in the June 2013 edition of the Journal of Economic Literature. …

In a paper titled ‘Bounded-Rationality Models: Tasks to Become Intellectually Competitive’, Harstad and Selten make a key point that although models have been elaborated which incorporate insights of boundedly rational behaviour, ‘the collection of alternative models has made little headway supplanting the dominant paradigm’ (2013, p. 496). Crawford’s symposium paper notes that ‘in most settings, there is an enormous number of logically possible models… that deviate from neoclassical models. In attempting to improve upon neoclassical models, it is essential to have some principled way of choosing among alternatives’ (2013, p. 524). He continues further on the same page ‘to improve on a neoclassical model, one must identify systematic deviations; otherwise one would do better to stick with a noisier neoclassical model’.

Rabin is possibly the most sympathetic of the symposium authors, noting for example that ‘many of the ways humans are less than fully rational are not because the right answers are so complex. They are instead because the wrong answers are so enticing’ (2013, p. 529). Rabin does go on, however, to state that ‘care should be taken to investigate whether the new models improve insight on average… in my view, many new models and explanations for experimental findings look artificially good and artificially insightful in the very limited domain to which they are applied’ (2013, p. 536). …

… Misbehaving does not deal nearly as well with the arguments that in many situations agents will learn to be rational. The arguments in the Journal of Economic Literature symposium both encompass and generalise this problem for behavioural economics. The authors accept without question that in many circumstances deviations from rationality are observed. However, no guidelines, no heuristics, are offered as to the circumstances in which systematic deviations might be expected, and circumstances where the rational model is still appropriate. Further, the theoretical models developed to explain some of the empirical findings in behavioural economics are very particular to the area of investigation, and do not readily permit generalisation.

On applying behavioural economics to policy:

In the final part (Part VIII) he discusses a modest number of examples where the insights of behavioural economics seem to have helped policymakers. He is at pains to point out that he is not trying to ‘replace markets with bureaucrats’ (p. 307). He discusses at some length the term he coined with Sunstein, ‘libertarian paternalism’. …

We might perhaps reflect on why it is necessary to invent this term at all. The aim of any democratic government is to improve the lot of the citizens who have elected it to power. A government may attempt to make life better for everyone, for the interest groups who voted for it, for the young, for the old, or for whatever division of the electorate which we care to name. But to do so, it has to implement policies that will lead to outcomes which are different from those which would otherwise have happened. They may succeed, they may fail. They may have unintended consequences, for good or for ill. By definition, government acts in paternalist ways. By the use of the word ‘libertarian’, Thaler could be seen as trying to distance himself from the world of the central planner.

… And yet the suspicion remains that the central planning mind set lurks beneath the surface. On page 324, for example, Thaler writes that ‘in our increasingly complicated world, people cannot be expected to have the experience to make anything close to the optimal decisions in all the domains in which they are forced to choose’. The implication is that behavioural economics both knows what is optimal for people and can help them get closer to the optimum.

Further, we read that ‘[a] big picture question that begs for more thorough behavioral analysis is the best way to encourage people to start new businesses (especially those which might be successful)’ (p. 351). It is the phrase in brackets which is of interest. Very few people, we can readily conjecture, start new businesses in order for them to fail. But most new firms do exactly that. Failure rates are very high, especially in the first two or three years of life. How exactly would we know whether a start-up was likely to be successful? There is indeed a point from the so-called ‘Gauntlet’ of orthodox economics which is valid in this particular context. Anyone who had a good insight into which start-ups were likely to be successful would surely be extremely rich.

Unchanging humans

One interesting thread to Don Norman’s excellent The Design of Everyday Things is the idea that while our tools and technologies are subject to constant change, humans stay the same. The fundamental psychology of humans is a relative constant.

Evolutionary change to people is always taking place, but the pace of human evolutionary change is measured in thousands of years. Human cultures change somewhat more rapidly over periods measured in decades or centuries. Microcultures, such as the way by which teenagers differ from adults, can change in a generation. What this means is that although technology is continually introducing new means of doing things, people are resistant to changes in the way they do things.

I feel this is generally the right perspective to think about human interaction with technology. There are certainly biological changes to humans based on their life experience. Take the larger hippocampus of London taxi drivers, increasing height through industrialisation, or the Flynn effect. But the basic building blocks are relatively constant. The humans of today and twenty years ago are close to being the same.

Every time I hear arguments about changing humans (or any discussion of millennials, generation X and the like), I recall the following quote from Bill Bernbach (I think first pointed out to me by Rory Sutherland):

It took millions of years for man’s instincts to develop. It will take millions more for them to even vary. It is fashionable to talk about changing man. A communicator must be concerned with unchanging man, with his obsessive drive to survive, to be admired, to succeed, to love, to take care of his own.

(If I were making a similar statement, I’d use a shorter time period than “millions”, but I think Bernbach’s point still stands.)

But for how long will this hold? Don Norman again:

For many millennia, even though technology has undergone radical change, people have remained the same. Will this hold true in the future? What happens as we add more and more enhancements inside the human body? People with prosthetic limbs will be faster, stronger, and better runners or sports players than normal players. Implanted hearing devices and artificial lenses and corneas are already in use. Implanted memory and communication devices will mean that some people will have permanently enhanced reality, never lacking for information. Implanted computational devices could enhance thinking, problem-solving, and decision-making. People might become cyborgs: part biology, part artificial technology. In turn, machines will become more like people, with neural-like computational abilities and humanlike behavior. Moreover, new developments in biology might add to the list of artificial supplements, with genetic modification of people and biological processors and devices for machines.

I suspect much of this, at least in the short term, will only relate to some humans. The masses will experience these changes with some lag.

(See also my last post on the human-machine mix.)

Getting the right human-machine mix

Much of the storytelling about the future and humans and machines runs with a theme that machines will not replace us, but that we will work with machines to create a combination greater than either alone. If you have heard the freestyle chess example, which now seems to be everywhere, you will understand the idea. (See my article in Behavioral Scientist if you haven’t.)

An interesting angle to this relationship is just how unsuited some of our existing human-machine combinations are for the unique skills of a human brings. As Don Norman writes in his excellent The Design of Everyday Things:

People are flexible, versatile, and creative. Machines are rigid, precise, and relatively fixed in their operations. There is a mismatch between the two, one that can lead to enhanced capability if used properly. Think of an electronic calculator. It doesn’t do mathematics like a person, but can solve problems people can’t. Moreover, calculators do not make errors. So the human plus calculator is a perfect collaboration: we humans figure out what the important problems are and how to state them. Then we use calculators to compute the solutions.

Difficulties arise when we do not think of people and machines as collaborative systems, but assign whatever tasks can be automated to the machines and leave the rest to people. This ends up requiring people to behave in machine like fashion, in ways that differ from human capabilities. We expect people to monitor machines, which means keeping alert for long periods, something we are bad at. We require people to do repeated operations with the extreme precision and accuracy required by machines, again something we are not good at. When we divide up the machine and human components of a task in this way, we fail to take advantage of human strengths and capabilities but instead rely upon areas where we are genetically, biologically unsuited.

The result is that at the moments when we expect the humans to act, we have set them up for failure:

We design equipment that requires people to be fully alert and attentive for hours, or to remember archaic, confusing procedures even if they are only used infrequently, sometimes only once in a lifetime. We put people in boring environments with nothing to do for hours on end, until suddenly they must respond quickly and accurately. Or we subject them to complex, high-workload environments, where they are continually interrupted while having to do multiple tasks simultaneously. Then we wonder why there is failure.

And:

Automation keeps getting more and more capable. Automatic systems can take over tasks that used to be done by people, whether it is maintaining the proper temperature, automatically keeping an automobile within its assigned lane at the correct distance from the car in front, enabling airplanes to fly by themselves from takeoff to landing, or allowing ships to navigate by themselves. When the automation works, the tasks are usually done as well as or better than by people. Moreover, it saves people from the dull, dreary routine tasks, allowing more useful, productive use of time, reducing fatigue and error. But when the task gets too complex, automation tends to give up. This, of course, is precisely when it is needed the most. The paradox is that automation can take over the dull, dreary tasks, but fail with the complex ones.

When automation fails, it often does so without warning. … When the failure occurs, the human is “out of the loop.” This means that the person has not been paying much attention to the operation, and it takes time for the failure to be noticed and evaluated, and then to decide how to respond.

There is an increasing catalogue of these types of failures. Air France flight 447, which crashed into the Atlantic in 2009, is a classic case. The autopilot suddenly handed to the pilots an otherwise well-functioning plane due to an airspeed indicator problem, leading to disaster. But perhaps this new type of failure is an acceptable result of the overall improvement in system safety or performance.

This human-machine mismatch is also a theme in Charles Perrow’s Normal Accidents. Perrow notes that many systems are poorly suited to human psychology, with long periods of inactivity interspersed by bunched workload. The humans are often pulled into the loop just at the moments things are starting to go wrong. The question is not how much work humans can safely do, but how little.