Metagenomics promises insight into uncultured microbes across space and time. Yet, the tsunami of low-cost sequencing meant to enable these discoveries is leaving scientists drowning in data. We present Libra, a comparative metagenomics algorithm, that considers genetic distance and microbial abundance simultaneously using a vector-space model, and scales using Apache Hadoop. We compare Libra to other tools to examine effects of data reduction and distance metrics using simulated metagenomes, controlled bacterial mixtures, and metagenomes from the Human Microbiome Project and Tara Oceans Expedition. We show that Libra provides accurate, efficient, and scalable compute for discerning global patterns in microbial ecology.
Bacteriophages play an important role in host-driven biological processes by controlling bacterial population size, horizontally transferring genes between hosts and expressing host-derived genes to alter host metabolism. Metagenomics provides the genetic basis for understanding the interplay between uncultured bacteria, their phage and the environment. In particular, viral metagenomes (viromes) are providing new insight into phage-encoded host genes (i.e. auxiliary metabolic genes; AMGs) that reprogram host metabolism during infection. Yet, despite deep sequencing efforts of viral communities, the majority of sequences have no match to known proteins. Reference-independent computational techniques, such as protein clustering, contig spectra and ecological profiling are overcoming these barriers to examine both the known and unknown components of viromes. As the field of viral metagenomics progresses, a critical assessment of tools is required as the majority of algorithms have been developed for analyzing bacteria. The aim of this paper is to offer an overview of current computational methodologies for virome analysis and to provide an example of reference-independent approaches using human skin viromes. Additionally, we present methods to carefully validate AMGs from host contamination. Despite computational challenges, these new methods offer novel insights into the diversity and functional roles of phages in diverse environments.
Viruses have global impact through mortality, nutrient cycling and horizontal gene transfer, yet their study is limited by complex methodologies with little validation. Here, we use triplicate metagenomes to compare common aquatic viral concentration and purification methods across four combinations as follows: (i) tangential flow filtration (TFF) and DNase + CsCl, (ii) FeCl3 precipitation and DNase, (iii) FeCl3 precipitation and DNase + CsCl and (iv) FeCl3 precipitation and DNase + sucrose. Taxonomic data (30% of reads) suggested that purification methods were statistically indistinguishable at any taxonomic level while concentration methods were significantly different at family and genus levels. Specifically, TFF-concentrated viral metagenomes had significantly fewer abundant viral types (Podoviridae and Phycodnaviridae) and more variability among Myoviridae than FeCl3 -precipitated viral metagenomes. More comprehensive analyses using protein clusters (66% of reads) and k-mers (100% of reads) showed 50-53% of these data were common to all four methods, and revealed trace bacterial DNA contamination in TFF-concentrated metagenomes and one of three replicates concentrated using FeCl3 and purified by DNase alone. Shared k-mer analyses also revealed that polymerases used in amplification impact the resulting metagenomes, with TaKaRa enriching for 'rare' reads relative to PfuTurbo. Together these results provide empirical data for making experimental design decisions in culture-independent viral ecology studies.
A plethora of tools exist for identifying phage sequences in bacterial genomes, single cell amplified genomes, and host-associated and environmental metagenomes. Yet because the genetics of phages and their hosts are closely intertwined, distinguishing viral from bacterial signal remains an ongoing challenge. Further the size, quantity and fragmentary nature of modern 'omics datasets ushers in a new set of computational challenges. Here, we detail the promises and pitfalls of using currently available gene-centric or k-mer based tools for identifying prophage sequences in genomes and prophage and viral contigs in metagenomes. Each of these methods offers a unique piece of the puzzle to elucidating the intriguing signatures of phage-host coevolution.