Title: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs
Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown
The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples, allowing us to reduce the size of the de novo assembly process for metagenomes with a "divide and conquer" algorithm. This graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to store assembly graphs in as little as 4 bits per k-mer. We use this approach to achieve a 20-fold decrease in memory for the assembly of a soil metagenome sample.Online resources and data:
- Git repository for khmer: github.com/ged-lab/khmer/tree/2012-paper-kmer-percolation
See figuregen/README for instructions on how to generate tables/figures from paper.
- Data from paper (.tar.gz, 2.2gb)