Power and limitations of genetics as a tool to study racial differences

The recent New York Times opinion piece by the geneticist David Reich on genetics of differences between human races has generated much discussion among my scientist friends. It was followed by a rebuttal from 67 scholars of diverse backgrounds, and a sympathetic article by the blogger and essayist Andrew Sullivan. I typically stay away from this topic because given its horrific history it deserves vary careful treatment by experts in a variety of disciplines. I am not well equipped to discuss most aspects of this, but it did strike me that a few key points are either missing in the debate or scattered across the various contributions. I will attempt to collect and lay out the most essential arguments here. I will confine myself to a fairly narrow scope, in order to not stray beyond my area of expertise. Truth be told, I am writing this to some extent in order to clarify my own thinking, but I hope it will help others as well.

In a nutshell, my concerns are that a somewhat ill-defined concept of "population" is used interchangeably with the even more nebulous, and historically fraught, concept of "race"; that too much attention is focused on small effects, while variation around these estimates is all but ignored; that within-population observations of trait heritability are used to infer the nature of among-population differences; and that the existence of a genetic component of variation in a trait is used to imply that it cannot be modified by environmental interference.

Before I go into detail on these points, let me lay a bit of groundwork. For the purposes of this discussion, a phenotype is a measurable characteristic of an individual, such as height, and I use this term interchangeably with "trait." We are concerned with the variation of phenotypic values among individuals, and want to know how genetic variation contributes to it. Note that this is not the same as studying the genetic basis of the trait itself. For example, we can look at two people of differing height. Each had to attain their stature by developing from an egg. A multitude of genes had to be turned on or off, executing a developmental program that produced the outcomes we see in these people. However, the DNA sequence of most of these genes is identical between our pair of individuals, and thus these genes do not contribute to the variation we observe.

We next need a way to talk about genetic variation, since this is the material that can potentially contribute to phenotypic differences. Genetic variation is simply a difference of DNA sequence among genomes of individuals. To find such variation, we use any of the available technologies to determine the DNA sequence of each person we picked for the study. We then take advantage of the fact that more than 99% of the genome will be identical between any two people on average to line up the DNA sequences. Looking at this alignement, we find the positions in the genome where there are mismatches. We are interested not only in the number of such positions, but also the frequency of the alternative state (which can span one or more nucleotides). We will call such states at the same position "alleles." For example, if out of 100 individuals two people carry one allele, while the rest carry another, we say that the minor allele frequency is 2%.

Populations and races

Now we are equipped to discuss the issues raised by David Reich. From the start, he clearly lays out the evidence that historical notions of race are ill-defined and do not necessarily line up with "populations." But what are these populations? The bloodless technical definition is: groups of individuals that can freely interbreed, but are at least somewhat reproductively isolated from other groups. The emphasis on mating is not accidental or prurient. Groups that do not interbreed do not exchange genetic information and thus can evolve independently from each other. Typically (but not always), population boundaries reflect some sort of geographical separation. So it is with humans. Throughout the piece, when Reich gives examples of research findings, he talks about geographical populations (such as "West Africa"). Yet, despite clearly describing the difficulty of relating the social concept of race to the geographical definition of populations, Reich still slips up and uses them interchangeably. Look at the following two paragraphs:

Beginning in 1972, genetic findings began to be incorporated into this argument. That year, the geneticist Richard Lewontin published an important study of variation in protein types in blood. He grouped the human populations he analyzed into seven "races" -- West Eurasians, Africans, East Asians, South Asians, Native Americans, Oceanians and Australians -- and found that around 85 percent of variation in the protein types could be accounted for by variation within populations and "races," and only 15 percent by variation across them. To the extent that there was variation among humans, he concluded, most of it was because of "differences between individuals."

In this way, a consensus was established that among human populations there are no differences large enough to support the concept of "biological race." Instead, it was argued, race is a "social construct," a way of categorizing people that changes over time and across countries.

Note the transition from Lewontin's study of geographical populations to the description of the consensus opinion on "biological race." A casual reader (as most readers undoubtedly are) would be forgiven for assuming the two terms are interchangeable. This mingling of concepts led to the rebuke by the group of 67 researchers voiced in the BuzzFeed piece. The rebuttal justifiably complains that Reich does not draw a sharp enough line between geographical populations and the historically-defined races. However, it then appears to deny any population differences whatsoever. The most offensive paragraph is

Given random variation, you could genotype all Red Sox fans and all Yankees fans and find that one group has a statistically significant higher frequency of a number of particular genetic variants than the other group -- perhaps even the same sort of variation that Reich found for the prostate cancer–related genes he studied. This does not mean that Red Sox fans and Yankees fans are genetically distinct races (though many might try to tell you they are).

It is absolutely true that, given enough genetic variants, any random partition of a group of humans into sub-groups will result in some of these variants differing in frequency between the sets of individuals. But every population geneticist knows that, and robust methods that are theoretically well motivated and extensively tested in practice exist to correct for these random differences. Claiming otherwise betrays a lack of understanding of even the basic methods used to analyze genetic data.

To recap, there are robustly identifiable differences in allele frequencies among geographically defined human populations, beyond what we would expect by chance. That said, these populations are clearly not the same as "races," a term with a horrific history of abuse and unclear definition. The study of the relationship, if any, between the two concepts requires cooperation at a minimum between geneticists, anthropologists, and sociologists. Although I am by no means an expert in the latter two areas, it seems fairly clear that races are at most poor proxies for geographical populations and thus not particularly useful for any purely genetic research program.

Small differences, large variation

If geographical populations are entities that we can discuss from a genetic perspective, why did I say in the beginning that they are "ill-defined"? If you go back to the definition I presented above, you will note that it is quantitative. There is more interbreeding within than among populations. But what level should we set as a boundary? In practice, if we are using genetic information to estimate the number of populations in our data set, we look for evidence of allele frequency mis-match between groups. If we find such evidence, beyond what we would expect by chance, we declare that the groups are "populations." As we get more and more data, adding individuals and DNA sequence variants, our power to detect subtle population structure increases. Lewontin, in his 1972 study mentioned in Reich's article, had data for 14 markers and a range of a dozen to 100 individuals, depending on the marker. His estimate was that 15% of variation in allele frequency among people from different continents was explainable by their geographic origin. A newer examination, published in 2012 using three million markers and 602 individuals, came up with a remarkably similar estimate (12%). But while these additional data do not seem to overturn the older results at continental scale, we are now able to make finer geographical distinctions, even on the scale of countries within Europe.

Despite the greater statistical confidence that we can attach to these fine distinctions, the magnitudes of differences are still tiny. Only about 1% of genetic variation is attributable to differences among populations within a continent. There a two ways to inderstand what this means. If we take two people from the same population on a continent and compare their DNA sequences, they will be roughly 12% more similar to each other than if we pick pairs from different continents, but only 1% more similar than two people from distinct populations on a continent. Another way to look at this is to try to predict a person's genetic composition. If we know nothing about the individual other than the fact that they are human, we can go through each site in their genome that is variable in the whole human population and assign the major allele at that site as our predicted value. Because most humans deviate from the mean, we will often be wrong in our prediction. If we additionally know what continent the person is from, we can increase our accuracy roughly by 34% (square root of 12%). If we further know the population, that information only gives us a 10% advantage (square root of 1%). The upshot is that well over half of the difference between people coming from different continents is due to individual, not geographic, factors. Once again, we are not talking about racial groups here. Since the alignment between race and population is at best imperfect, the explanatory power of race is even less.

Finally, let me point out that up to now we did not even mention any phenotypes. We have been dealing only with differences in DNA sequence. Reich moves quickly between discussing allele frequencies and phenotypes, in my opinion not doing enough to direct the reader's attention to the transition. The extent to which variation in phenotypes among individuals is explained by the variation in their genetic makeup is called heritability. The statistical machinery used to calculate it is the same as the one used for estimating what fraction of genetic variation is due to geographical factors, but the biological meaning is obviously different. Heritability depends on many factors, and can be vastly different among phenotypes. Thus, when when Reich writes that "The ancestors of East Asians, Europeans, West Africans and Australians were, until recently, almost completely isolated from one another for 40,000 years or longer, which is more than sufficient time for the forces of evolution to work," he is talking about the divergence of genotype frequencies. Much difficult work is still required to establish what, if any, fraction of this genetic differentiation has phenotypic consequences, and how the genetics interact with the distinct environments found at these locations.

Heritability of traits and population differences

The crucial thing to understand about heritability is that it can be estimated only within populations. While in theory DNA variation may contribute to between-population differences in phenotype, we cannot directly observe these effects. We instead rely on statistical models to infer the genetic contribution to phenotypic variation. Random mating within the sample is a crucial assumption in these models. While they depend on the data set structure and the kind of data available (most notably, do we have genome sequence or are we relying on similarities between relatives?), essentially these models amount to estimating correlations between genotypes and phenotypes. As most people know, correlation is not causation. To convince yourself of this you can play with the data sets in Google Correlate. Since we know from biology that information flows exclusively from genotype to phenotype, we are not concerned about misinterpreting the direction of causality from the correlations between genotypes and phenotypes. But spurious correlations can also arise if the things we measure are both influenced by an unobserved factor. In our case, population structure is one of the most problematic confounding factors in heritability estimates. Population differences in phenotypes can be due to local environmental effects, while genotype differences can be due to historical contingencies that we have no way of re-tracing. More subtle, and harder to control for, problems such as sampling bias can also occur.

Given this background, it is easy to see that heritability of a trait within populations says nothing about the genetic basis of between-population differences. This is the glaring error made by Sullivan in his article:

... genetics have a significant part to play (heritability ranges from 0.4 to 0.8) in explaining different racial outcomes in intelligence tests...

David Reich of course knows better, but even he makes a more subtle form of this mistake. After discussing the studies of the genetics of educational attainment and IQ (at least this is my best guess at the papers he mentioned), he writes

Is performance on an intelligence test or the number of years of school a person attends shaped by the way a person is brought up? Of course. But does it measure something having to do with some aspect of behavior or cognition? Almost certainly. And since all traits influenced by genetics are expected to differ across populations (because the frequencies of genetic variations are rarely exactly the same across populations), the genetic influences on behavior and cognition will differ across populations, too.

But the studies he cites were all conducted within populations. Furthermore, as is typical when estimating genome-wide associations between phenotypes and genotypes, the authors explicitly control for population structure within their samples. Non-genetic differences between populations can work in the opposite direction of the genetic variants identified within populations. Worse, the relationship can be non-linear, with the magnitude and direction of the genetic effect itself depending on the environment. Let me say this again, the magnitude of within-population heritability says absolutely nothing about the genetics of between-population differences.

I am not saying that there are no meaningful genetically-influenced phenotypic differences between human populations. I am saying that our research on trait heritability and genome-wide associations does not help us find such differences and does not allow us to form any expectations as to what the eventual answers will look like. I believe this is true even for model systems and well-studied agricultural species, and remains an area of active research.

While we are on the subject of genome-wide associations, let me loop back to what I said before about effect magnitudes. Like with the power to detect allele frequency differences between populations, our ability to detect genotype-phenotype associations grows with sample size. The studies mentioned above have very large samples, and so are able to measure very subtle effects. But tiny effects is all they find. Okbay and colleagues, for example, find that the variants they identify explain between 0.01 and 0.035% (yes, percent) of the variance. Knowing an individual's genotype at these positions in the genome yields virtually no useful information if we want to predict these people's educational attainment. As a fellow geneticist, I understand why David Reich is excited about these results. There are reasons other than phenotype prediction to go after such effects, although it is appropriate to question whether genome-wide associations like these are cost-effective given limited research budgets. But when we communicate these results to the public we must make an effort to step back and put them in proper perspective, and realize with a measure of humility that genetics is not all there is to biology.

Directional environmental intervention and genetics

So far, I have argued that even though we can confidently identify even subtle differences in genotype frequencies among, and associations of DNA variants with phenotypes within, populations, we still have no idea what, if anything, the genetic variation contributes to among-population phenotypic differentiation. Nevertheless it is possible, although not very likely, that genetic variation will turn out to drive most population differences even for moderately heritable phenotypes such as IQ. Does this mean that we can do nothing to mitigate these differences with environmental interventions? The answer is unequivocally "no." David Reich is very emphatic on this point. The 67 signatories do not appear to accept that populations are in any way genetically differentiated. I wonder if they are driven to this untenable position by the misconception that a finding of a genetic basis of such differences would mean that nothing can be done about them, a possibility they would find catastrophic. Andrew Sullivan comes at this from the opposite perspective. While he does not quite argue that nothing can, and therefore should, be done about group disparities in IQ because of their genetic nature, he does write

It’s both undeniable to me that much human progress has occurred, especially on race, gender, and sexual orientation; and yet I’m suspicious of the idea that our core nature can be remade or denied.

Charles Murray, the author of the Bell Curve, and Sam Harris in their conversation recorded about a year ago go much further and flatly say that there is nothing we can do about traits, IQ in particular, that are "moderately heritable," i.e. about 50%.

To see why this is completely wrong, we can look at men's height measurements. This is a highly-heritable trait, with 89 to 93% of variation explained by genetic factors. Nevertheless, average male height has been steadily increasing over the past 150 years. Most, if not all, of this increase is likely due to changes in environmental factors such as nutrition. Perhaps an even more striking example is provided by the inborn errors of metabolism, diseases that are caused by mutations in genes required in metabolic pathways. These are not only close to 100% heritable, but also caused by single mutations. This is as genetic as we can get, yet many of these disorders can be managed and sometimes completely cured by dietary modifications.

Given the dark history of bigotry and xenophobia, discussions of race cannot be viewed as purely intellectual exercises. The influence of these debates on public policy and attitudes affect lives of millions of people. Geneticists have an important role to play in the discussions, and I am happy to see a prominent scientist like David Reich calmly and very ably engaging the public on this topic. However, as a close look at his contribution reveals, it is extremely hard to present a clear, yet nuanced view of the genetics of racial differences. We have to work extra hard to choose our words appropriately and always be clear what aspect of this tangled problem we are talking about. Unconscious professional biases also creep in. The only remedy is a group effort, with constructive criticism and engagement with scholars from other disciplines and the public. Those who are bent on misconstruing genetics to rationalize bad acts are probably mostly unreachable, but the majority of the public will likely respond reasonably given correct information. Well-meaning policy makers who are worried that any discovery of a genetic basis for socially important traits like educational attainment or IQ would render their efforts moot should be reassured that genes are not destiny, even when they play important roles in trait expression.

Fast ordered sampling of SNPs from large files

Ever since I left my academic position and became independent, I have been interested in the idea of using minimal computer resources to perform big-data statistical analyses. I do not have a permanent office, so the only computer I own is a laptop, although it is a close to maximal-spec 15-inch MacBook Pro (mid-2015). I can use Amazon's AWS for really big jobs, but the necessity to quickly trouble-shoot software and pipelines remains.

The major bottleneck for me is the size of DNA variant data sets. Phenotype data are increasing as well, but so far lag genomic sets by a couple of oders of magnitude. Binary compression, most notably the .bed format developed by the plink team, goes a long way towards making it possible to manipulate large genotype data sets on a laptop. To speed development even further, I needed to generate random subsets of large variant files. The samples had to reflect whole-data-set properties, most importantly loci had to be ordered along chromosomes the same as in the original file. I wrote a tool I call sampleSNPs to do this. As I was using it, I realized that it can have many more applications. For example, samples could be used for jack-knife uncertainty estimates for whole-genome statistics. While I personally stick with .bed files, I extended the functionality to other popular variant file formats (VCF, TPED, and HapMap). The only program I could find that also does this is plink, but it does it wrong. You give plink the fraction of variants you need (say, 0.2) and it will include each locus with that probability. As a result, the total number of SNPs in your sample varies from run to run. Here is an illustration, after running plink 1000 times:


When helping people with genome-wide association, mapping, and genome prediction analyses I am frequently asked to estimate the rate of decline of linkage disequilibrium with distance. This problem seems easy, but it blows up quickly as the number of marker grows and it becomes infeasible to estimate all pairwise LD. Sliding window approaches work, but waste disk space by over-sampling loci that are close to each other. I applied my sampling scheme to pairs of markers, resulting in a quick method that still produces reasonable estimates while using trivial computer resources.

The sampling scheme itself deserves a mention. Designing an algorithm like this is surprisingly hard. While most classical computational problems of this sort have been solved in the late 70s and early 80s, the on-line ordered sampling algorithm solution did not appear until 1987 in a paper by Jeffrey Vitter. Maybe because it came later than most, this work has been languishing in relative obscurity. I stumbled upon it by chance while reading some unrelated blog posts on computational methods.

I came up with my own C++ implementation of the Vitter algorithm, using Kevin Lawler's C code as a guide. I wrote up a manuscript, available as a pre-print here and here while I am working on getting it published in a journal. The whole project is on GitHub. The included C++ library I wrote as a basis for the stand-alone programs has additional functionality and can be used in other software. It has no dependencies and is being released under the BSD three-part license.

While I created these tools to scratch my own itch, I hope other people will find them useful. I hope in particular that researchers with limited access to powerfull hardware can use this to break into genomics and statistical genetics. I would love to get user feedback, especially if there are any problems.

Accidental benchmarking of Apple's new file system, APFS

Apple recently released High Sierra, the new version of Mac OS. While it is billed as one of those "stability releases" with few user-facing changes, it introduces a new file system, APFS. This system has already been rolled out for iOS devices. This is not something users interact with directly, but Apple lists a number of features (such as on-disk system snapshots) that are enabled by APFS.

Coincidentally, I have been working on a method to randomly sample records from files. High Sierra has become available in the middle of the project, while I was repeatedly benchmarking my code. I will report the details of the method soon (watch this space!). To test a base-line method, I wrote a program (in C++, compiled with Apple's llvm with -O3 optimization) that reads a line of the target file and then probabilistically decides whether it will save it to an output file or not. The files can be text or binary. Given that the sampling is reasonably sparse, the execution is dominated by file reading operations. Binary files are read with the read() ifstream method, while text files are processed with an overload of the getline() function. I then use the clock() function to time execution. I vary the number of records sampled, and perform 15 replicates to estimate execution time variability, which can be due to any number of factors. For example, since I execute the program on my laptop other processes running at the same time can interfere by commandeering file I/O facilities.

I ran my program on my MacBook Pro 15-inch (mid-2015) laptop with an SSD. Re-running it recently, after I updated to High Sierra and APFS, I noticed about two-fold speed-up on a binary input file. This is shown below in two plots: the one on the left was generated when I was still on Sierra and therefore HFS+, while the plot on the left was generated on the same computer after updating to High Sierra with APFS. The amount of free space on the drive was comparable, and the disk was encrypted with FileVault under both systems.


The x-axes indicate the number of samples taken from a file of fixed size. The y-axes are the time it took to perform each operation (in milliseconds). Note that the average time taken, as well as that for each sample size, was reduced by half after updating to APFS.

Execution timing of the same scheme on a text file did not decrease, however. The following pair of plots is organized the same as before, and the y-axis values are the same before and after the update.


The outlier observations that pop up on the new OS are not related to the system update. They occurred (fairly inconsistently) under HFS+, too.

Note that all operations were done on an SSD. APFS apparently does not support spinning drives yet. I saw another set of file system benchmarks that also showed some speed-ups, but my results may provide a useful extra data point for people running file I/O-intensive applications. I was unable to find any other comparisons that separate binary and text file operations. I would love to hear about other users' experiences. Please contact me with questions or comments. I will release the source code once I complete the project that was actually the point of this exercise.

The Google diversity memo makes an important mistake

The recently-published memo (full text can be found at the end of this article) from a Google employee about the causes of gender disparity in the number of people working at the company generated a lively discussion. The arguments are as wide-ranging as the original post and involve many important topics. Given my area of expertise, I want to focus on a fairly narrow but fundamental part of the case made by the memo's author. He summarizes it as follows:

Differences in distributions of traits between men and women may in part explain why we don't have 50% representation of women in tech and leadership.

I think it is fair to say that essentially everything else in the Google diversity memo flows from this assertion. Before we begin examining it, let us lay down some groundwork.

To fix ideas, I start with a somewhat abstract and simple situation. Suppose we have two populations (call them A and B) that, on average, differ in some characteristic. While there is a real difference between means, each population consists of non-identical individuals, and so there are some distributions around the means. These distributions overlap substantially. I illustrate this scenario in the graph below (long vertical lines are the means).


Now suppose we are looking for people to hire and we need them to score 2 or above on this nameless attribute. Assume further that we can exactly determine each person's score. We can then screen members of both groups and select everyone who passes our test. If each group is equally represented in the applicant pool, the chosen population will have more people from group B than from group A (in the plotted example the A/B ratio is 0.43). Given all this, would we be justified in only considering individuals from group B for employment? I would argue no. The reason is simple: there are people who belong to group A and are qualified under our test. It is unfair to exclude them by disregarding their personal characteristics. Ignoring individuality in favor of group labeling is pretty much the definition of discrimination. While this is a value judgment and may not be universal, our society clearly accepts it, to the point where it is codified in law. In fact, the Google memo author emphasizes that he agrees with it:

Many of these differences are small and there’s significant overlap between men and women, so you can’t say anything about an individual given these population level distributions.

He, instead, argues that while we cannot use group labels to discriminate against individuals, a disparity in group representation does not imply bias. This is because, as we see from our toy example, even when we screen applicants fairly and precisely, group A is underrepresented in the resulting pool of successful candidates. Since a dearth of people from minority groups in organizations is almost universally used as a measure of discrimination, this argument would seem to undermine one of very few (if not the only) quantitative measure of bias we have. The Google employee's memo is not the first to raise this problem. Larry Summers, then Harvard president, got in no small amount of hot water more than a decade ago for presenting a similar argument. The only difference is that he chose to highlight between-population differences in variance rather than mean.

On the face of it, this seems like a compelling argument. It would appear that my model also supports it. But dig a bit deeper and we start running into problems. The crucial simplification that makes our theoretical exercise work is that groups A and B are defined by the very characteristic we wish to assess for employment suitability. This is emphatically not the case when the groups are defined by the shape of their genitalia or the color of their skin. These attributes are not directly relevant to most occupations, certainly not software engineering. Rather, we are told that group traits are associated with mental abilities that in turn predict success in engineering or some other desirable occupation. The absurdity of this position can be illustrated if instead of gender identity we group people by, say, the size of their pinky toe. We would demand that anyone asserting a relationship between such a trait and aptitude in writing computer code provide convincing evidence of an association. A long history of systematically propagated bias is the only reason gender and race are not met with the same level of skepticism when offered as group identifiers that supposedly predict skill at particular tasks. Of course, there are actually reasons to believe that, for example, gender is not inherently a good predictor of talent in STEM fields. But in any case the burden of proof is on the person arguing that gender identity predicts job performance. The Google employee's memo provides no such evidence and is essentially a somewhat long-winded exercise in begging the question.

Measures of minority underrepresentation in workplaces and whole fields have an important role to play in assessment of discrimination. Naturally, these assessments have to be done carefully and used thoughtfully in conjunction with other data. But they should not be disregarded.

Quickly extracting SNPs from alignments

Working on a Drosophila population genetics project, I needed to extract SNPs from a large-ish data set of sequence alignments from the Drosophila Genome Nexus. The alignments are in a slightly strange but handy format: each line and chromosome arm is in a separate file. The sequences are all on one line in FASTA format, but with no traditional FASTA header. I needed 283 lines plus the D. simulans outgroup. I decided to come up with a way to do the SNP extraction on my laptop in reasonable time, and save to a popular (at least in quantitative genetics) BED format used by plink. Briefly, this SNP table format uses two bits to represent a genotype (since there are only four states: reference, alternative, heterozygote, or missing) and therefore is really compact.

I wrote a C++11 class that does the conversion and employed it in a program I call align2bed. I did not use any third-party libraries and only relied on the C++ STL. The processing of each chromosome arm was parallelized using the C++11 thread library. The align2bed program is rather narrowly tailored to the fly data set I was focusing on, but the underlying class comes as a separate library, is more general, and can be easily dropped into any project. I am releasing it under the BSD three-part license.

The approach worked really well. I can process the data for all 283 lines in under five minutes on my 15" MacBook Pro (quad-core i7 and 16 Gb of RAM). It occurred to me that other people might find this useful so I posted the project to my GitHub page. Usage and compilation instructions are included in the distribution, and class documentation is here. The library and align2bed can be used either to process large data sets on regular desktop or laptop computers, or to crunch through a large volume of data (e.g., derived from simulations).

I would like to hear any feedback from users, especially if there are any problems. Please leave a comment after this post or contact me via the contact page.

Democratic turn-out in the 2016 election

The unexpected election of Donald Trump as the next President of the United States has generated an avalanche of articles and social media posts attempting to explain what went wrong and why the pre-election predictions were so off. This event seems of a piece with the earlier similarly unexpected (based on polling) votes for Britain to leave the EU and Colombia to scuttle the peace deal with the FARC guerillas. Much of the commentary has thus been focused on explaining the poll failure and on the possible motivations of Trump voters. Commentators typically use exit polls and percentages of the vote won by the candidates to bolster their arguments. When we look at the fraction of votes received, however, we are looking at a ratio of two quantities. Changing either the numerator or the denominator can change the value of the ratio. Might a separate examination of the changes in the number of votes won by each party tell us something about what happened? I started thinking about this after one of my friends shared a Twitter post comparing the Democratic turnout in the three latest elections. This quick analysis only looked at aggregate national numbers, an approach that likely obscures important patterns.

I decided to see what can be learned from looking at how the turnout of Democrats and Republicans has changed from the 2008 election. I chose 2008 as a baseline because that election was relatively recent, and so the overall size of the electorate has not changed much. On top of that, turnout was particularly strong in 2008. This is useful because this set of voters probably represents the maximum realistically attainable pool (at least for Democrats). I downloaded official vote count data for the 2008 and 2012 elections from here and here. Official numbers are not available for 2016 yet, so I got the most recent counts from here. I doubt that these will change enough to be noticeable on the plots I present below. To meaningfully compare vote counts, I calculated percent deviations in ballots cast from the 2008 numbers. The annotated R script I wrote to analyze and plot the data, as well as the data themselves and all the results, can be downloaded from here.

I first checked the original observations that spurred this exersize by plotting the national aggregate deviations. I see the same pattern:


It really does seem like the Democratic turnout this year significantly dropped even from the 2012 levels, while Republicans voted in similar numbers each of these years. How does this pattern vary from state to state? Here are the results with each party plotted separately:


On the Democratic side the national trend is reflected in the state results, but there are important deviations. The states that bucked the national trend seem to have large hispanic populations (Nevada), are located in the South (South Carolina, Louisiana) or both (Florida, Texas, North Carolina). I do not know enough about the demographic composition of these states to make any deeper observations, but it would be interesting to find out what is special about them. These numbers suggest that the predominant post-election conversation that focuses on the identity and motivations of the Trump voter are missing an important part of the story. Did the vaunted Clinton get-out-the-vote (GOTV) operation fail or did it manage to prevent an even larger collapse? My analyses cannot answer this question, but we can get some hints. As I mentioned, some Southern states, where the GOTV effort was probably not as active as in the battlegrounds, nevertheless bucked the trend in reduced turnout. This suggests that GOTV may not have had much of an effect. We can also look at the distributions of turnout change in "red", "blue" and "purple" (swing) states. Here is the plot (it is a box-plot; instructions on how to read it can be found here):


One would expect that the purple states have received more GOTV attention than the reliably red and blue ones. However, there is no difference in turnout among these groups. This simple analysis is of course far from definitive, but maybe it is worth re-examining critically the effectiveness of this campaign tactic.

Another possible explanation of the drop-off of the Democratic participation is the new voter ID requirements passed by a number of states. There are credible reasons to think that these disproportionally affect reliable Democratic voters such as minorities and students. I found data on voter ID requirements on the BallotPedia website, where the terms listed in the plot below are explained. Here is the plot:


There does not seem to be any meaningful difference among states grouped according to voter ID requirements. This, again, is not the definitive analysis, but it does suggest that if these laws have an effect it is subtle and is not a big contributor to the 2016 election result. Perhaps precinct-level analyses would uncover some contribution of voter ID requirements, especially where they are accompanied by closures of voting places in affected communities.

While the pattern among Democrats is relatively simple, the nation-wide result for Republicans hides considerable heterogeneity. The clear increase in turnout in many Midwestern and other states affected by de-industrialization (West Virginia, for example) is offset by big drops in deep-blue states like California. An interesting outlier is Utah, where there was a credible third-party challenge from Evan McMullin. It does appear that Trump brought in some additional voters in crucial battle-ground states, despite his reportedly shambolic GOTV efforts. Understanding the motivations of these voters is important. There are two major schools of thought on this topic: the "economic anxiety" and "pure bigotry" camps. I do not claim that my analyses here can settle this debate. That said, some arguments can be examined in light of the patterns I see. For example, an interesting argument against the "pure bigotry" hypothesis was advanced by Tim Carney:

Low-income rural white voters in Pa. voted for Obama in 2008 and then Trump in 2016, and your explanation is white supremacy? Interesting.

— Tim Carney (@TPCarney) November 9, 2016

This seems persuasive at first glance, but Trump's win in Pennsylvania came as a result of fewer Democrats showing up to vote and some new Republican voters making it to the polls. Now, it is possible that the people who voted for Obama in 2008 are the same people who are now in Trump's column. But the same result could have come about without anyone changing their mind about Obama. Those voting Republican this year may not be the same people who voted for Obama in 2008 despite the impression one gets from looking net shift in vote share. In other states, such as Ohio, the difference was entirely due to the drop in the number of Democrats. Carney's argument does not seem dispositive.

The glaring failure of polling-based predictions should lead to a re-examination of the methods we use to study what motivates voters. Analyses of actual votes cast can be a promising tool. Even a quick and simple look at these data uncovers some interesting patterns that provide a check on the developing conventional wisdom about the forces that brought Trump to power. Anemic turnout of Democrats in mid-term elections has been widely discussed. It seems like the party has a bigger problem on their hands and has to develop a strategy to overcome this obstacle if they are to return to winning elections.

Inaugural post

After about 20 years in academic science (25 if you count volunteering in a lab as an undergrad) I finally decided to go my own way. My academic career path has been unorthodox up to now, and I have been lucky to have the support of great advisers along the way. However, the mismatch between what I want to do and what is valued for the purposes of academic career advancement has become impossible to ignore. So instead of struggling to reconcile the incompatible I decided to try a different approach.

In order to advance in academic science one generally has to publish as many first (or senior) author papers as possible. While quality is not completely irrelevant, it does seem to come second to quantity. To support their projects researchers also must secure funding, mostly from the government. The latter seems to go more and more to groups of scientists doing Big Science, and the size that matters is measured in terabytes of data rather than impact on our understanding of nature. This point seems to be in conflict with the first-author publication requirement, since in these large consortia a number of scientists make approximately equal contributions. How (or if) this gets resolved is not obvious to me.

The way I enjoy doing science does not seem to sit well in the current system. For one, I like to work on hard problems that I am not fully qualified to tackle. This means learning as I go, which takes commitment for a much longer stretch of time than is acceptable for the current publication speed requirements. As a result of this approach I have acquired expertise in a wide variety of fields, both experimental and computational. Statistical methods I learned turned out to be both unusual and useful enough that I ended up lending a hand in a number of projects. Since I was just the computational and data analysis support person, however, my contributions did not rise to the level of first authorship (appropriately so). In addition, realising that my approaches can have a real-life use in practical breeding applications, I became interested in helping practical breeders achieve their goals by providing data analysis and training. These types of activities are not frowned-upon in academia, but they do not add significantly to one's portfolio of achievements required for career advancement.

At the time I was realizing that my interests started to diverge from the standard academic path, I was working on increasingly computationally demanding projects. As part of that, I started using Amazon Web Services. This made me realize that I hardly need any infrastructure to do this work on my own. For even the most computationally intensive work, all I need is a reasonably good laptop and an internet connection. I was also impressed by a number of independent tech researchers. For example, Steve Gibson sells a piece of software to support himself but releases the bulk of his work for free. It became harder and harder to justify continuing to try and fit into the academic model.

This summer I finally decided to make the jump. I created Bayesic Research LLC to pursue my interests in applying Bayesian hierarchical models to quantitative genetics, particularly in plants where large-scale multi-site experiments with complicated replication designs are commonplace. I am working on method development, software implementation and pursuit of interesting fundamental questions in genetics. I am applying what I learn from these activities to practical breeding programs, focusing on the needs of the developing world, currently in collaboration with IRRI and CIMMYT. I do not plan to charge anyone for this work, except perhaps to cover costs in travel and computational time, neither do I believe in charging for copies of software I generate. The source of the major C++ library I am developing is available on GitHub under the GNU public license, and the same will be true of any other software projects. To support this endeavor, I am looking for commercial clients who need consulting or custom data analysis. I estimate that if I spend about a third of my time on doing paid work, I can generate enough revenue to maintain my research activities.

I think that my particular set of skills and interests makes this kind of experiment practical, but it is unclear how generalizable this course of action is. I do believe that it has promise as an alternative career path, and would very much like to hear if anyone else has had or is planning a similar move. I will periodically post about my experiences here.