Genome Wide Association Studies and socioeconomic outcomes

A few months back, I posted about a Conference on Genetics and Behaviour held by the Human Capital and Economic Opportunity Global Working Group at the University of Chicago. In that post, I linked to a series of videos from the first session on the effect of genes on socioeconomic aggregates.

Over the last couple of days, I watched the videos from the session on Genome Wide Association Studies (GWAS). As for the first set of videos, they are technical (as you might expect for a bunch of academics) – particularly the questions – but cover some important points.

In early studies linking genetic factors to behaviour and socioeconomic outcomes, candidate gene studies were the dominant method. In a candidate gene study, a gene is hypothesised to have an effect, and that hypothesis is tested directly. However, there are some major problems with candidate gene studies, with the literature littered with claims of the “gene for X” that simply can’t be replicated.

David Cesarini opened the session by pointing to this low level of replication of candidate gene studies. He suggests three problems might be causing this failure to replicate. These are multiple hypothesis testing coupled with publication bias, population stratification, and the low power of the small samples typically used.

Multiple hypothesis testing in candidate gene studies arises because more than one gene tends to be tested. In that case, the significance level of the tests should be adjusted to account for the multiple tests. But the reality is that the many negative tests never see the light of day, with the successful ones presented as successfully meeting a threshold appropriate for a single test. Publication bias exacerbates that problem as negative results tend not the be published and you don’t know how many tests have been conducted.

In contrast, GWAS is a hypothesis free approach. All SNPs in a sample (single nucleotide polymorphisms – DNA sequence variations in which a single nucleotide varies in the population) are tested for association with a trait. As there are as many hypotheses being tested as there are SNPs, very high significance thresholds are applied to avoid false positives. But as the number of SNPs in an array is known from the start, there is no doubt about the appropriate threshold.

Cesarini’s talk focused on the second problem, population stratification. This occurs where allele (variants of a gene) frequencies correlate with confounding variables. A classic example is analysing a mixed population of Asians and Caucasians and discovering the chopsticks gene. This can be overcome in GWAS by a technique called principal components analysis, which can be used to model the ancestry of the population and correct for stratification before conducting the analysis.

The next speaker, Daniel Benjamin, spoke on the third problem – the low power of candidate gene studies. Power is the ability to statistically demonstrate an association when that association exists. A test with low power will miss the associations most of the time.

The low power of candidate gene studies is partly due to their typically low sample size, usually between 50 and 3,000 people. Benjamin points out that there may not be any genes in social science with effects large enough to be detected in samples of this size.

The low power of a study has an important implication beyond the inability to find any effects that exist. If real results are rare, they will be swamped by the false positives, which would occur for 1 in 20 tests using the typical significance level. Benjamin runs through some numerical examples and shows that given the expected effect sizes of genes on social science outcomes, you simply shouldn’t trust most candidate gene study results. False positives will drown the real findings. This contrasts with GWAS. Once you get to decent sample sizes in the order of 100,000, you can be relatively confident that what you do find (even though you miss a lot) will be true.

Benjamin also talks about the Social Science Genetic Association Consortium (SSGAC), which is an attempt to build datasets large enough to apply GWAS to social outcomes such as IQ and risk aversion. The proof of concept was on educational attainment, which the next speaker covers in more detail.

Philipp Koellinger opens by asking why there are so many null results in the search for genetic influences. Is it because the effects are small? Because they are non-linear? Or there are gene-environment interactions? Maybe the results of twin studies showing most social outcomes are heritable are wrong?

Part of the answer was given by a study of educational attainment in which Koellinger and the previous two speakers were involved. They used a GWAS to search for SNPs that affected educational attainment in an initial sample of 100,000 people. They then replicated the result in another sample of 25,000 people. All three SNPs found in the discovery stage were replicated.

Importantly, the effect sizes were smaller than expected, with those three SNPs explaining 0.02% of the variation in educational attainment. If you added up the effects of all the SNPs in their sample, you could explain around 2 to 2.5% of the variation.

While this sounds low, it provides a basis for hope. Based on projections for larger sample size, it should be possible to explain 20% of the variation in education attainment through genetic factors.

Jason Fletcher was next, and he asked two main questions. First, how much should we believe GWAS results given how differently GWAS is done compared to normal science procedure. Second, what use are GWAS results? He spends more time on the second question and points out the usual possibilities, such as providing measures for latent variables. For example, if you don’t know the IQ of your sample but have their genomes and know how this affects intelligence, the genetic information could be used to attempt to determine the effect of IQ on a certain outcome.

Fletcher also points to the potential for exploration of gene-environment effects. He gives the example of people responding differently to tobacco taxation based on having different alleles. His paper on this topic is here.

Within his talk, Fletcher asks an interesting question about whether the SSGAC will become a natural monopoly in GWAS. Do we need a second SSGAC to enable people to check the results, and is it feasible for one to emerge? Others may be more viable as genetic testing becomes cheaper, but the tendency for one to dominate may still remain.

In the questions to Fletcher’s presentation, Benjamin makes the important point that the use of GWAS results as control variables could give much more precision to the estimates of the effect that a social science experiment is designed to measure. He gives the example of the Perry pre-school project – expensive educational interventions with a small sample, in which any added precision as to their effects would be of great value.

The last speaker, Dalton Conley, returned to the population stratification problem. His argument is that it may not be as easy to solve as it seems. Conley refers mainly to a technique called Genomic-relatedness-matrix restricted maximum likelihood (GREML) or Genome-wide complex trait analysis (GCTA) (which I have posted about before). This technique seeks to determine the contribution of all the sampled SNPs combined to variation in a trait. The output is a lower bound estimate of heritability. This technique relies, however, on an assumption that among those who are less related than second cousins (higher degrees of relatedness are removed), they share alleles in a way that is uncorrelated with any similarity in environment.

Conley argues that this assumption is false, and shows that using GREML, he can obtain a finding that birth in an urban or rural environment is heritable, in direct violation of the assumption. This result does not disappear after controlling for population stratification.

To deal with this problem, consideration should be given to testing for variation within families – any differences in genes between siblings will truly be random. The problem with this is that most massive datasets for which GWAS is performed don’t have pedigree data of that nature. The good news, however, is that the violation of the assumption does not seem to puncture the GWAS results. It is violated but the consequences are trivial. A paper by Conley and friends on this paper can be found here.

Genoeconomics and the ENCODE project

The ENCODE (Encyclopedia of DNA Elements) project is an international collaboration that intends “to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.”

The project has made a splash in the last couple of days with the publication of thirty open access papers across Nature, Genome Research and Genome Biology describing some of the results. Much of the blogosphere has been hosing down the declarations of the accompanying press releases, so don’t expect any revolutions to come out of this work just yet. Similarly, the ENCODE project is not about to spur the genoeconomics revolution (the use of molecular genetics in economics). However, the project is a reminder that there is some very cool work going on (at least for those of us not already in the loop).

One important consideration for genoeconomics is how the ENCODE project might affect genome wide association studies (GWAS). ENCODE outputs were compared with previous results of GWAS for disease, and support was found for previous results. As described on the Nature News site:

Since 2005, genome-wide association studies (GWAS) have spat out thousands of points on the genome in which a single-letter difference, or variant, seems to be associated with disease risk. But almost 90% of these variants fall outside protein-coding genes, so researchers have little clue as to how they might cause or influence disease.

The map created by ENCODE reveals that many of the disease-linked regions include enhancers or other functional sequences. And cell type is important. Kellis’s group looked at some of the variants that are strongly associated with systemic lupus erythematosus, a disease in which the immune system attacks the body’s own tissues. The team noticed that the variants identified in GWAS tended to be in regulatory regions of the genome that were active in an immune-cell line, but not necessarily in other types of cell and Kellis’s postdoc Lucas Ward has created a web portal called HaploReg, which allows researchers to screen variants identified in GWAS against ENCODE data in a systematic way. “We are now, thanks to ENCODE, able to attack much more complex diseases,” Kellis says.

The problem for the genoeconomics enterprise is that the existing GWAS on economic traits are often of questionable value. Any results that are not spurious are of such small effect that biochemical analysis is not much use. Further, converting genetic activity to outcomes such as time or risk preference is a much more difficult proposition than examining disease pathways.

So, for the moment, the genoeconomics enterprise is probably best left examining twin studies, GREML analysis or other techniques that don’t need a particular gene and trait to be nailed down. That said, despite being a long way from being able to control for genetic effects by examining someone’s genome, we are not short of information that we can use.

The more interesting part of the events of the last couple of days, as has been noted in many blogs, is the publication model adopted for this release of the ENCODE results. While not without problems (Daniel MacArthur’s mixed reaction is one example worth reading), the information available and the way it is presented is quite cool and hopefully another step towards more open access to data in the field. You can download an Ipad app which has the thirty open access papers, plus an interesting feature called “threads” which allows exploration of issues across the papers. Much of it is heavy going for someone not in the field, and it is useful to use the blogosphere to interpret the information, but there are worse ways to get up to speed with what is happening.

Genoeconomics: molecular genetics and economics

The Journal of Economic Perspectives has an excellent article by Beauchamp and colleagues titled Molecular Genetics and Economics (ungated pdf here). It is a nice contrast to another article in the same issue, Charles Manski’s bashing of the heritability straw man.

The authors argue that “genoeconomics”, the use of molecular genetics in economics, has the potential to supplement traditional behavioural genetic studies and build an understanding of the biology underlying economically relevant traits. They note that behavioural genetics, particularly research into heritability, has produced compelling evidence of the link between economically important characteristics and DNA. Molecular genetics is an “exciting tool” that can now be turned to this area.

However, potential pitfalls mar the way forward. These pitfalls are beautifully illustrated by a study that the authors undertook in which they sampled over half a million single-nucleotide polymorphisms (SNPs) from each of 7,500 people. An SNP is a DNA sequence variation where a single nucleotide differs between people. They then searched for SNPs associated with educational attainment. They found a large number of associations, many passing significance tests of 10-6. Passing this test suggests that there is a one in a million chance that the association is by chance (of course, there were 500,000 chances). If they took this result to the right journal, they might have had their study published and got some headlines about “the education gene”.

However, the authors took the 20 most significant associations from the first sample and checked them against the SNPs from another sample of 9,500 people. In the second sample, none of these 20 SNPs significantly affected educational attainment, even using a weak five per cent significance test. This showed that the results from the first sample were spurious.

The authors noted some important lessons from this. The first is that given the low sample size of many studies, the probability of a true association being discovered among the noise is minute. The studies are underpowered – power being the probability that an association between an SNP and the trait of interest will be found when there is a relationship. The fact that almost all SNPs reported in the literature can explain very little of the variation in most traits exacerbates this problem as the studies are trying to detect small effects. For example, no marker has been found to predict more than one per cent of the variation in height between people. As a result, very large samples are required to find true associations and sort them from the noise.

For example, with a five per cent threshold test for significance and an SNP that explains 0.1 per cent of variation in a trait, you need a sample of 4,000 subjects before the association has a 50 per cent chance of being found. Yet, in a 500,000 SNP panel there are likely to be thousands of false positives that meet the five per cent significance level.

If the significance test is increased to by a factor of one million to 10-8, which is appropriate given the huge number of potential associations being tested for in a 500,000 SNP panel, the need for a large sample size increases. For an SNP that explains 0.1 per cent of variation in a trait, the study will need a sample of around 25,000 to have a 50 per cent chance of detecting the relationship. If the SNP explains 0.01 per cent of the variation, a sample size of 200,000 results in only a 20 per cent chance of finding the relationship. However, the more stringent significance test reduces the number of false positives – it is just that the reduced number of false positives comes at the cost of power, which must be compensated for by increased sample size. At this time, there is little useful genetic data available in samples of this size.

Beyond the power issue, the authors identified publication bias as a problem. Papers which find interesting relationships are more likely to be published, which creates incentives for data mining and the write-up of results that are interesting but not robust. It is not easy to find a publisher for a paper that shows no relationship. This paper by Beauchamp and colleagues is the exception that proves the rule. To get their negative finding published they turned it into an analysis of the broader use of genetics in economics.

They do note, however, that data mining in genoeconomics is not in itself bad. It is when it is not accompanied by robust methodologies and stringent review processes that the problems arise.

Beauchamp and colleagues close their paper by noting some benefits of the genoeconomics enterprise. They endorse the use of genetic information in policy, even where the causal mechanisms are not known. They give the example of targeting children with markers for dyslexia with alternative teaching methods. This is a good long-term goal, but we will want to have SNPs explaining more variation in traits before this will be useful. For now, family history or information about siblings and twins is more useful information. How much of that  information is being used now?

More interestingly, they suggest that this genetic data could be used as a control variable in other economics studies. If it is known that, say, income varies with certain SNPs, those SNPs might be used as a control in a study of how certain environmental factors affect income.

Their last suggestion is that the information obtained from genoeconomics could be used to understand variation in policy response across people. Compared to the standard economic assumption that everyone is the same, this might be the most radical effect of the genoeconomics enterprise.

Beauchamp, J., Cesarini, D., Johannesson, M., van der Loos, M., Koellinger, P., Groenen, P., Fowler, J., Rosenquist, J., Thurik, A., & Christakis, N. (2011). Molecular Genetics and Economics Journal of Economic Perspectives, 25 (4), 57-82 DOI: 10.1257/jep.25.4.57