Fast ordered sampling of SNPs from large files

Ever since I left my academic position and became independent, I have been interested in the idea of using minimal computer resources to perform big-data statistical analyses. I do not have a permanent office, so the only computer I own is a laptop, although it is a close to maximal-spec 15-inch MacBook Pro (mid-2015). I can use Amazon's AWS for really big jobs, but the necessity to quickly trouble-shoot software and pipelines remains.

The major bottleneck for me is the size of DNA variant data sets. Phenotype data are increasing as well, but so far lag genomic sets by a couple of oders of magnitude. Binary compression, most notably the .bed format developed by the plink team, goes a long way towards making it possible to manipulate large genotype data sets on a laptop. To speed development even further, I needed to generate random subsets of large variant files. The samples had to reflect whole-data-set properties, most importantly loci had to be ordered along chromosomes the same as in the original file. I wrote a tool I call sampleSNPs to do this. As I was using it, I realized that it can have many more applications. For example, samples could be used for jack-knife uncertainty estimates for whole-genome statistics. While I personally stick with .bed files, I extended the functionality to other popular variant file formats (VCF, TPED, and HapMap). The only program I could find that also does this is plink, but it does it wrong. You give plink the fraction of variants you need (say, 0.2) and it will include each locus with that probability. As a result, the total number of SNPs in your sample varies from run to run. Here is an illustration, after running plink 1000 times:


When helping people with genome-wide association, mapping, and genome prediction analyses I am frequently asked to estimate the rate of decline of linkage disequilibrium with distance. This problem seems easy, but it blows up quickly as the number of marker grows and it becomes infeasible to estimate all pairwise LD. Sliding window approaches work, but waste disk space by over-sampling loci that are close to each other. I applied my sampling scheme to pairs of markers, resulting in a quick method that still produces reasonable estimates while using trivial computer resources.

The sampling scheme itself deserves a mention. Designing an algorithm like this is surprisingly hard. While most classical computational problems of this sort have been solved in the late 70s and early 80s, the on-line ordered sampling algorithm solution did not appear until 1987 in a paper by Jeffrey Vitter. Maybe because it came later than most, this work has been languishing in relative obscurity. I stumbled upon it by chance while reading some unrelated blog posts on computational methods.

I came up with my own C++ implementation of the Vitter algorithm, using Kevin Lawler's C code as a guide. I wrote up a manuscript, available as a pre-print here and here while I am working on getting it published in a journal. The whole project is on GitHub. The included C++ library I wrote as a basis for the stand-alone programs has additional functionality and can be used in other software. It has no dependencies and is being released under the BSD three-part license.

While I created these tools to scratch my own itch, I hope other people will find them useful. I hope in particular that researchers with limited access to powerfull hardware can use this to break into genomics and statistical genetics. I would love to get user feedback, especially if there are any problems.

Accidental benchmarking of Apple's new file system, APFS

Apple recently released High Sierra, the new version of Mac OS. While it is billed as one of those "stability releases" with few user-facing changes, it introduces a new file system, APFS. This system has already been rolled out for iOS devices. This is not something users interact with directly, but Apple lists a number of features (such as on-disk system snapshots) that are enabled by APFS.

Coincidentally, I have been working on a method to randomly sample records from files. High Sierra has become available in the middle of the project, while I was repeatedly benchmarking my code. I will report the details of the method soon (watch this space!). To test a base-line method, I wrote a program (in C++, compiled with Apple's llvm with -O3 optimization) that reads a line of the target file and then probabilistically decides whether it will save it to an output file or not. The files can be text or binary. Given that the sampling is reasonably sparse, the execution is dominated by file reading operations. Binary files are read with the read() ifstream method, while text files are processed with an overload of the getline() function. I then use the clock() function to time execution. I vary the number of records sampled, and perform 15 replicates to estimate execution time variability, which can be due to any number of factors. For example, since I execute the program on my laptop other processes running at the same time can interfere by commandeering file I/O facilities.

I ran my program on my MacBook Pro 15-inch (mid-2015) laptop with an SSD. Re-running it recently, after I updated to High Sierra and APFS, I noticed about two-fold speed-up on a binary input file. This is shown below in two plots: the one on the left was generated when I was still on Sierra and therefore HFS+, while the plot on the left was generated on the same computer after updating to High Sierra with APFS. The amount of free space on the drive was comparable, and the disk was encrypted with FileVault under both systems.


The x-axes indicate the number of samples taken from a file of fixed size. The y-axes are the time it took to perform each operation (in milliseconds). Note that the average time taken, as well as that for each sample size, was reduced by half after updating to APFS.

Execution timing of the same scheme on a text file did not decrease, however. The following pair of plots is organized the same as before, and the y-axis values are the same before and after the update.


The outlier observations that pop up on the new OS are not related to the system update. They occurred (fairly inconsistently) under HFS+, too.

Note that all operations were done on an SSD. APFS apparently does not support spinning drives yet. I saw another set of file system benchmarks that also showed some speed-ups, but my results may provide a useful extra data point for people running file I/O-intensive applications. I was unable to find any other comparisons that separate binary and text file operations. I would love to hear about other users' experiences. Please contact me with questions or comments. I will release the source code once I complete the project that was actually the point of this exercise.

The Google diversity memo makes an important mistake

The recently-published memo (full text can be found at the end of this article) from a Google employee about the causes of gender disparity in the number of people working at the company generated a lively discussion. The arguments are as wide-ranging as the original post and involve many important topics. Given my area of expertise, I want to focus on a fairly narrow but fundamental part of the case made by the memo's author. He summarizes it as follows:

Differences in distributions of traits between men and women may in part explain why we don't have 50% representation of women in tech and leadership.

I think it is fair to say that essentially everything else in the Google diversity memo flows from this assertion. Before we begin examining it, let us lay down some groundwork.

To fix ideas, I start with a somewhat abstract and simple situation. Suppose we have two populations (call them A and B) that, on average, differ in some characteristic. While there is a real difference between means, each population consists of non-identical individuals, and so there are some distributions around the means. These distributions overlap substantially. I illustrate this scenario in the graph below (long vertical lines are the means).


Now suppose we are looking for people to hire and we need them to score 2 or above on this nameless attribute. Assume further that we can exactly determine each person's score. We can then screen members of both groups and select everyone who passes our test. If each group is equally represented in the applicant pool, the chosen population will have more people from group B than from group A (in the plotted example the A/B ratio is 0.43). Given all this, would we be justified in only considering individuals from group B for employment? I would argue no. The reason is simple: there are people who belong to group A and are qualified under our test. It is unfair to exclude them by disregarding their personal characteristics. Ignoring individuality in favor of group labeling is pretty much the definition of discrimination. While this is a value judgment and may not be universal, our society clearly accepts it, to the point where it is codified in law. In fact, the Google memo author emphasizes that he agrees with it:

Many of these differences are small and there’s significant overlap between men and women, so you can’t say anything about an individual given these population level distributions.

He, instead, argues that while we cannot use group labels to discriminate against individuals, a disparity in group representation does not imply bias. This is because, as we see from our toy example, even when we screen applicants fairly and precisely, group A is underrepresented in the resulting pool of successful candidates. Since a dearth of people from minority groups in organizations is almost universally used as a measure of discrimination, this argument would seem to undermine one of very few (if not the only) quantitative measure of bias we have. The Google employee's memo is not the first to raise this problem. Larry Summers, then Harvard president, got in no small amount of hot water more than a decade ago for presenting a similar argument. The only difference is that he chose to highlight between-population differences in variance rather than mean.

On the face of it, this seems like a compelling argument. It would appear that my model also supports it. But dig a bit deeper and we start running into problems. The crucial simplification that makes our theoretical exercise work is that groups A and B are defined by the very characteristic we wish to assess for employment suitability. This is emphatically not the case when the groups are defined by the shape of their genitalia or the color of their skin. These attributes are not directly relevant to most occupations, certainly not software engineering. Rather, we are told that group traits are associated with mental abilities that in turn predict success in engineering or some other desirable occupation. The absurdity of this position can be illustrated if instead of gender identity we group people by, say, the size of their pinky toe. We would demand that anyone asserting a relationship between such a trait and aptitude in writing computer code provide convincing evidence of an association. A long history of systematically propagated bias is the only reason gender and race are not met with the same level of skepticism when offered as group identifiers that supposedly predict skill at particular tasks. Of course, there are actually reasons to believe that, for example, gender is not inherently a good predictor of talent in STEM fields. But in any case the burden of proof is on the person arguing that gender identity predicts job performance. The Google employee's memo provides no such evidence and is essentially a somewhat long-winded exercise in begging the question.

Measures of minority underrepresentation in workplaces and whole fields have an important role to play in assessment of discrimination. Naturally, these assessments have to be done carefully and used thoughtfully in conjunction with other data. But they should not be disregarded.

Quickly extracting SNPs from alignments

Working on a Drosophila population genetics project, I needed to extract SNPs from a large-ish data set of sequence alignments from the Drosophila Genome Nexus. The alignments are in a slightly strange but handy format: each line and chromosome arm is in a separate file. The sequences are all on one line in FASTA format, but with no traditional FASTA header. I needed 283 lines plus the D. simulans outgroup. I decided to come up with a way to do the SNP extraction on my laptop in reasonable time, and save to a popular (at least in quantitative genetics) BED format used by plink. Briefly, this SNP table format uses two bits to represent a genotype (since there are only four states: reference, alternative, heterozygote, or missing) and therefore is really compact.

I wrote a C++11 class that does the conversion and employed it in a program I call align2bed. I did not use any third-party libraries and only relied on the C++ STL. The processing of each chromosome arm was parallelized using the C++11 thread library. The align2bed program is rather narrowly tailored to the fly data set I was focusing on, but the underlying class comes as a separate library, is more general, and can be easily dropped into any project. I am releasing it under the BSD three-part license.

The approach worked really well. I can process the data for all 283 lines in under five minutes on my 15" MacBook Pro (quad-core i7 and 16 Gb of RAM). It occurred to me that other people might find this useful so I posted the project to my GitHub page. Usage and compilation instructions are included in the distribution, and class documentation is here. The library and align2bed can be used either to process large data sets on regular desktop or laptop computers, or to crunch through a large volume of data (e.g., derived from simulations).

I would like to hear any feedback from users, especially if there are any problems. Please leave a comment after this post or contact me via the contact page.

Democratic turn-out in the 2016 election

The unexpected election of Donald Trump as the next President of the United States has generated an avalanche of articles and social media posts attempting to explain what went wrong and why the pre-election predictions were so off. This event seems of a piece with the earlier similarly unexpected (based on polling) votes for Britain to leave the EU and Colombia to scuttle the peace deal with the FARC guerillas. Much of the commentary has thus been focused on explaining the poll failure and on the possible motivations of Trump voters. Commentators typically use exit polls and percentages of the vote won by the candidates to bolster their arguments. When we look at the fraction of votes received, however, we are looking at a ratio of two quantities. Changing either the numerator or the denominator can change the value of the ratio. Might a separate examination of the changes in the number of votes won by each party tell us something about what happened? I started thinking about this after one of my friends shared a Twitter post comparing the Democratic turnout in the three latest elections. This quick analysis only looked at aggregate national numbers, an approach that likely obscures important patterns.

I decided to see what can be learned from looking at how the turnout of Democrats and Republicans has changed from the 2008 election. I chose 2008 as a baseline because that election was relatively recent, and so the overall size of the electorate has not changed much. On top of that, turnout was particularly strong in 2008. This is useful because this set of voters probably represents the maximum realistically attainable pool (at least for Democrats). I downloaded official vote count data for the 2008 and 2012 elections from here and here. Official numbers are not available for 2016 yet, so I got the most recent counts from here. I doubt that these will change enough to be noticeable on the plots I present below. To meaningfully compare vote counts, I calculated percent deviations in ballots cast from the 2008 numbers. The annotated R script I wrote to analyze and plot the data, as well as the data themselves and all the results, can be downloaded from here.

I first checked the original observations that spurred this exersize by plotting the national aggregate deviations. I see the same pattern:


It really does seem like the Democratic turnout this year significantly dropped even from the 2012 levels, while Republicans voted in similar numbers each of these years. How does this pattern vary from state to state? Here are the results with each party plotted separately:


On the Democratic side the national trend is reflected in the state results, but there are important deviations. The states that bucked the national trend seem to have large hispanic populations (Nevada), are located in the South (South Carolina, Louisiana) or both (Florida, Texas, North Carolina). I do not know enough about the demographic composition of these states to make any deeper observations, but it would be interesting to find out what is special about them. These numbers suggest that the predominant post-election conversation that focuses on the identity and motivations of the Trump voter are missing an important part of the story. Did the vaunted Clinton get-out-the-vote (GOTV) operation fail or did it manage to prevent an even larger collapse? My analyses cannot answer this question, but we can get some hints. As I mentioned, some Southern states, where the GOTV effort was probably not as active as in the battlegrounds, nevertheless bucked the trend in reduced turnout. This suggests that GOTV may not have had much of an effect. We can also look at the distributions of turnout change in "red", "blue" and "purple" (swing) states. Here is the plot (it is a box-plot; instructions on how to read it can be found here):


One would expect that the purple states have received more GOTV attention than the reliably red and blue ones. However, there is no difference in turnout among these groups. This simple analysis is of course far from definitive, but maybe it is worth re-examining critically the effectiveness of this campaign tactic.

Another possible explanation of the drop-off of the Democratic participation is the new voter ID requirements passed by a number of states. There are credible reasons to think that these disproportionally affect reliable Democratic voters such as minorities and students. I found data on voter ID requirements on the BallotPedia website, where the terms listed in the plot below are explained. Here is the plot:


There does not seem to be any meaningful difference among states grouped according to voter ID requirements. This, again, is not the definitive analysis, but it does suggest that if these laws have an effect it is subtle and is not a big contributor to the 2016 election result. Perhaps precinct-level analyses would uncover some contribution of voter ID requirements, especially where they are accompanied by closures of voting places in affected communities.

While the pattern among Democrats is relatively simple, the nation-wide result for Republicans hides considerable heterogeneity. The clear increase in turnout in many Midwestern and other states affected by de-industrialization (West Virginia, for example) is offset by big drops in deep-blue states like California. An interesting outlier is Utah, where there was a credible third-party challenge from Evan McMullin. It does appear that Trump brought in some additional voters in crucial battle-ground states, despite his reportedly shambolic GOTV efforts. Understanding the motivations of these voters is important. There are two major schools of thought on this topic: the "economic anxiety" and "pure bigotry" camps. I do not claim that my analyses here can settle this debate. That said, some arguments can be examined in light of the patterns I see. For example, an interesting argument against the "pure bigotry" hypothesis was advanced by Tim Carney:

Low-income rural white voters in Pa. voted for Obama in 2008 and then Trump in 2016, and your explanation is white supremacy? Interesting.

— Tim Carney (@TPCarney) November 9, 2016

This seems persuasive at first glance, but Trump's win in Pennsylvania came as a result of fewer Democrats showing up to vote and some new Republican voters making it to the polls. Now, it is possible that the people who voted for Obama in 2008 are the same people who are now in Trump's column. But the same result could have come about without anyone changing their mind about Obama. Those voting Republican this year may not be the same people who voted for Obama in 2008 despite the impression one gets from looking net shift in vote share. In other states, such as Ohio, the difference was entirely due to the drop in the number of Democrats. Carney's argument does not seem dispositive.

The glaring failure of polling-based predictions should lead to a re-examination of the methods we use to study what motivates voters. Analyses of actual votes cast can be a promising tool. Even a quick and simple look at these data uncovers some interesting patterns that provide a check on the developing conventional wisdom about the forces that brought Trump to power. Anemic turnout of Democrats in mid-term elections has been widely discussed. It seems like the party has a bigger problem on their hands and has to develop a strategy to overcome this obstacle if they are to return to winning elections.

Inaugural post

After about 20 years in academic science (25 if you count volunteering in a lab as an undergrad) I finally decided to go my own way. My academic career path has been unorthodox up to now, and I have been lucky to have the support of great advisers along the way. However, the mismatch between what I want to do and what is valued for the purposes of academic career advancement has become impossible to ignore. So instead of struggling to reconcile the incompatible I decided to try a different approach.

In order to advance in academic science one generally has to publish as many first (or senior) author papers as possible. While quality is not completely irrelevant, it does seem to come second to quantity. To support their projects researchers also must secure funding, mostly from the government. The latter seems to go more and more to groups of scientists doing Big Science, and the size that matters is measured in terabytes of data rather than impact on our understanding of nature. This point seems to be in conflict with the first-author publication requirement, since in these large consortia a number of scientists make approximately equal contributions. How (or if) this gets resolved is not obvious to me.

The way I enjoy doing science does not seem to sit well in the current system. For one, I like to work on hard problems that I am not fully qualified to tackle. This means learning as I go, which takes commitment for a much longer stretch of time than is acceptable for the current publication speed requirements. As a result of this approach I have acquired expertise in a wide variety of fields, both experimental and computational. Statistical methods I learned turned out to be both unusual and useful enough that I ended up lending a hand in a number of projects. Since I was just the computational and data analysis support person, however, my contributions did not rise to the level of first authorship (appropriately so). In addition, realising that my approaches can have a real-life use in practical breeding applications, I became interested in helping practical breeders achieve their goals by providing data analysis and training. These types of activities are not frowned-upon in academia, but they do not add significantly to one's portfolio of achievements required for career advancement.

At the time I was realizing that my interests started to diverge from the standard academic path, I was working on increasingly computationally demanding projects. As part of that, I started using Amazon Web Services. This made me realize that I hardly need any infrastructure to do this work on my own. For even the most computationally intensive work, all I need is a reasonably good laptop and an internet connection. I was also impressed by a number of independent tech researchers. For example, Steve Gibson sells a piece of software to support himself but releases the bulk of his work for free. It became harder and harder to justify continuing to try and fit into the academic model.

This summer I finally decided to make the jump. I created Bayesic Research LLC to pursue my interests in applying Bayesian hierarchical models to quantitative genetics, particularly in plants where large-scale multi-site experiments with complicated replication designs are commonplace. I am working on method development, software implementation and pursuit of interesting fundamental questions in genetics. I am applying what I learn from these activities to practical breeding programs, focusing on the needs of the developing world, currently in collaboration with IRRI and CIMMYT. I do not plan to charge anyone for this work, except perhaps to cover costs in travel and computational time, neither do I believe in charging for copies of software I generate. The source of the major C++ library I am developing is available on GitHub under the GNU public license, and the same will be true of any other software projects. To support this endeavor, I am looking for commercial clients who need consulting or custom data analysis. I estimate that if I spend about a third of my time on doing paid work, I can generate enough revenue to maintain my research activities.

I think that my particular set of skills and interests makes this kind of experiment practical, but it is unclear how generalizable this course of action is. I do believe that it has promise as an alternative career path, and would very much like to hear if anyone else has had or is planning a similar move. I will periodically post about my experiences here.