Working on a Drosophila population genetics project, I needed to extract SNPs from a large-ish data set of sequence alignments from the Drosophila Genome Nexus. The alignments are in a slightly strange but handy format: each line and chromosome arm is in a separate file. The sequences are all on one line in FASTA format, but with no traditional FASTA header. I needed 283 lines plus the D. simulans outgroup. I decided to come up with a way to do the SNP extraction on my laptop in reasonable time, and save to a popular (at least in quantitative genetics) BED format used by plink. Briefly, this SNP table format uses two bits to represent a genotype (since there are only four states: reference, alternative, heterozygote, or missing) and therefore is really compact.
I wrote a C++11 class that does the conversion and employed it in a program I call
align2bed. I did not use any third-party libraries and only relied on the C++ STL. The processing of each chromosome arm was parallelized using the C++11
thread library. The
align2bed program is rather narrowly tailored to the fly data set I was focusing on, but the underlying class comes as a separate library, is more general, and can be easily dropped into any project. I am releasing it under the BSD three-part license.
The approach worked really well. I can process the data for all 283 lines in under five minutes on my 15" MacBook Pro (quad-core i7 and 16 Gb of RAM). It occurred to me that other people might find this useful so I posted the project to my GitHub page. Usage and compilation instructions are included in the distribution, and class documentation is here. The library and
align2bed can be used either to process large data sets on regular desktop or laptop computers, or to crunch through a large volume of data (e.g., derived from simulations).
I would like to hear any feedback from users, especially if there are any problems. Please leave a comment after this post or contact me via the contact page.