Phylogeny-based classification of microbial communities

Abstract

Motivation: Next-generation sequencing coupled with metagenomics has led to the rapid growth of sequence databases and enabled a new branch of microbiology called comparative metagenomics. Comparative metagenomic analysis studies compositional patterns within and between different environments providing a deep insight into the structure and function of complex microbial communities. It is a fast growing field that requires the development of novel supervised learning techniques for addressing challenges associated with metagenomic data, e.g. sensitivity to the choice of sequence similarity cutoff used to define operational taxonomic units (OTUs), high dimensionality and sparsity of the data and so forth. On the other hand, the natural properties of microbial community data may provide useful information about the structure of the data. For example, similarity between species encoded by a phylogenetic tree captures the relationship between OTUs and may be useful for the analysis of complex microbial datasets where the diversity patterns comprise features at multiple taxonomic levels. Even though some of the challenges have been addressed by learning algorithms in the literature, none of the available methods take advantage of the inherent properties of metagenomic data. Results: We proposed a novel supervised classification method for microbial community samples, where each sample is represented as a set of OTU frequencies, which takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree. This model allows us to take advantage of environment-specific compositional patterns that may contain features at multiple granularity levels. Our method is based on the multinomial logistic regression model with a tree-guided penalty function. Additionally, we proposed a new simulation framework for generating 16S ribosomal RNA gene read counts that may be useful in comparative metagenomics research. Our experimental results on simulated and real data show that the phylogenetic information used in our method improves the classification accuracy. Availability and implementation:http://www.cs.ucr.edu/∼tanaseio/metaphyl.htm. Contact:tanaseio@cs.ucr.edu or jiang@cs.ucr.edu Supplementary Information:Supplementary data are available at Bioinformatics online.

Publication
Bioinformatics
comments powered by Disqus