Measurement nihilism

Following from my recent post on Scott Barry Kaufman’s heritability measurement nihilism, Jonah Lehrer has gone a step further and taken a swipe at measurement in general, and in particular, at short-term tests. Lehrer argues that:

The larger lesson is that we’ve built our society around tests of performance that fail to predict what really matters: what happens once the test is over.

I’m not averse to arguments that some people use measurements in inappropriate ways. However, Lehrer overstates his case in this article as he misses a crucial element of his argument – that people are actually mis-using the performance measures in the way he suggests.

Lehrer pulls out three examples in support of his position. First, Lehrer notes that tests of short-term cashier speed had a surprisingly weak correlation with longer term speed as measured by the electronic cashier system. There was a gap between maximum performance when they were being tested and typical performance when they weren’t. All I can say on this example is that I am sure that grocery store workers don’t get tested once for 30 items and then get left alone for the rest of their scanning careers.

Second, Lehrer points out that while SAT scores can predict around 12 per cent of the variation in freshman grade point average, they are less effective in predicting post-graduation achievement. Similarly, the LSAT had almost no relationship with career success. (I do not know which studies Lehrer is referring to here, so I can’t comment on the specific results.) On this example, I might be concerned if SAT scores were the sole entry measurement and they had such low predictive power, but SAT scores are not used on their own. College admission departments combine them with grade point averages, interviews and examination of the applicant’s CV. The SAT score may give context to the grade point average by indicating the mix of talent and hard work that led to their high school performance.

Third, Lehrer calls the NFL Scouting Combine (a week-long event showcasing around 300 NFL aspirants during which they undergo a series of tests) “a big waste of time” as according to a recent study there is no consistent statistical relationship between the results of the Combine and NFL performance. Lehrer draws this conclusion from a study by Kuzmits and Adams. Looking at the original paper, one of the performance indicators used in the study was draft position. The authors found that there was no consistent statistical relationship between Combine performance and draft position, except for the speed of running backs. The lack of a correlation between Combine performance and draft position suggests that for the teams making the draft decision, they already know the lack of predictive ability of many of the Combine’s tests.

The study authors also noted that a range of other activities take place at the Combine, such as team interviews, injury evaluations and urine tests. If we consider the advertising and other fan interest generated by the event, there are a number of plausible benefits. Instead, Lehrer has taken an indicator that the teams do not use and argues that as that measurement does not matter, the Combine is a waste.

While my perspective is that these measurements aren’t being abused in the way Lehrer implies, Lehrer identifies the problem with these short-term tests as the failure to identify “grit”. Lehrer states:

The problem, of course, is that students don’t reveal their levels of grit while taking a brief test. Grit can only be assessed by tracking typical performance for an extended period. Do people persevere, even in the face of difficulty? How do they act when no one else is watching? Such traits often matter more than raw talent. We hear about them in letters of recommendation, but hard numbers take priority.

It is interesting that Lehrer does not consider that there are any short-term tests which might indicate how people act in the face of difficulty or when people are not watching. Lehrer is certainly aware of Walter Mischel’s marshmallow test and the large predictive power this test (given to four-year old children) had for future life success. Lehrer also seems to ignore that when colleges make decisions using SAT scores or Combine results, the decision makers use long-term performance data in the form of high school grades and college football performance.

I wonder what Lehrer would recommend be done about these tests. Would he simply abolish the Combine and SAT? My perspective is that, if anything, we are not measuring enough. Like Michael Lewis argued in Moneyball was the case for baseball drafting, I am sympathetic to the view that we pay too little attention to measurement and too much to gut instinct. For Lehrer to convincingly argue that short-term measurement is playing too much of a role, I’d like to see some evidence that there are alternative measurements that outperform those indicators that are actually being used.

3 comments

  1. If I recall (from the Lehrers blog), grit was measured with a very short test, and a self-report at that. And, the issue with difference between short term performance not being the same as long term seems related to the Hawthorne effect (which I think it is named – the effect that you perform better when observed than otherwise). Issues with measurements is, of course, something you spend an awful lot of time thinking about when you do research, and no measures are perfect.

    Also, what struck me a bit about the ‘non-predictive’ of SAT, and LSAT is that you use those tests to truncate the sample.

    1. The truncation of the samples (and the nature of the samples in general) are one reason in particular why I would like to know which studies Lehrer was alluding to.

  2. The American SMPY study has found that SAT-M and SAT-V scores have substantial predictive power. Kids who scored at the top 1/10,000 level were more likely to become professors at top 50 US universities than graduate students from the top 15 math-intensive graduate programs. Now, this might not result purely from enhanced ability, but it shows measurements are *useful,* albeit imperfect.

Comments are closed.