PMID: 20678216;PMCID: PMC3161368;Abstract:
Background. There is increasing demand to test hypotheses that contrast the evolution of genes and gene families among genomes, using simulations that work across these levels of organization. The EvolSimulator program was developed recently to provide a highly flexible platform for forward simulations of amino acid evolution in multiple related lineages of haploid genomes, permitting copy number variation and lateral gene transfer. Synonymous nucleotide evolution is not currently supported, however, and would be highly advantageous for comparisons to full genome, transcriptome, and single nucleotide polymorphism (SNP) datasets. In addition, EvolSimulator creates new genomes for each simulation, and does not allow the input of user-specified sequences and gene family information, limiting the incorporation of further biological realism and/or user manipulations of the data. Findings. We present modified C++ source code for the EvolSimulator platform, which we provide as the extension module NU-IN. With NU-IN, synonymous and non-synonymous nucleotide evolution is fully implemented, and the user has the ability to use real or previously-simulated sequence data to initiate a simulation of one or more lineages. Gene family membership can be optionally specified, as well as gene retention probabilities that model biased gene retention. We provide PERL scripts to assist the user in deriving this information from previous simulations. We demonstrate the features of NU-IN by simulating genome duplication (polyploidy) in the presence of ongoing copy number variation in an evolving lineage. This example is initiated with real genomic data, and produces output that we analyse directly with existing bioinformatic pipelines. Conclusions. The NU-IN extension module is a publicly available open source software (GNU GPLv3 license) extension to EvolSimulator. With the NU-IN module, users are now able to simulate both drift and selection at the nucleotide, amino acid, copy number, and gene family levels across sets of related genomes, for user-specified starting sequences and associated parameters. These features can be used to generate simulated genomic datasets under an extremely broad array of conditions, and with a high degree of biological realism. © 2010 Dlugosch et al; licensee BioMed Central Ltd.
PMID: 19473382;PMCID: PMC2731706;Abstract:
While speciation can be found in the presence of gene flow, it is not clear what impact this gene flow has on genome- and range-wide patterns of differentiation. Here we examine gene flow across the entire range of the common sunflower, H. annuus, its historically allopatric sister species H. argophyllus and a more distantly related, sympatric relative H. petiolaris. Analysis of genotypes at 26 microsatellite loci in 1015 individuals from across the range of the three species showed substantial introgression between geographically proximal populations of H. annuus and H. petiolaris, limited introgression between H. annuus and H. argophyllus, and essentially no gene flow between the allopatric pair, H. argophyllus and H. petiolaris. Analysis of sequence divergence levels among the three species in 1420 orthologs identified from EST databases identified a subset of loci showing extremely low divergence between H. annuus and H. petiolaris and extremely high divergence between the sister species H. annuus and H. argophyllus, consistent with introgression between H. annuus and H. petiolaris at these loci. Thus, at many loci, the allopatric sister species are more genetically divergent than the more distantly related sympatric species, which have exchanged genes across much of the genome while remaining morphologically and ecologically distinct. © 2009 The Society for the Study of Evolution.
The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees. Users can develop computational pipelines to analyse these data, in conjunction with data of their own that they can upload. Computationally estimated protein-protein interactions and biochemical pathways can be visualized at another site. Finally, we comment on our future plans and how they fit within this scalable system for the dissemination, visualization, and analysis of large multi-species data sets.