The overall aim of my dissertation is to show that a novel source of phylogenetic information from the plastid genome, the tRNA interaction network, coupled with machine-learning and distance-based methods, is capable of accurately reconstructing deep phylogenetic relationships. First, we review the history of the plastid genome as a source of phylogenetic information, discuss sources of systematic biases of plastid sequence data, and introduce the transfer RNA (tRNA) interaction network as a source of phylogenetic data.
Second, I determine the phylogenetic origin of plastids within
the Cyanobacteria tree of life (CyanoToL). Previous studies have strongly supported contradictory conclusions, with plastids branching either early or late within the CyanoToL. I begin by predicting structural features that determine the charging potential of a tRNA with its cognate amino acid, termed tRNA Class Informative Features (CIFs) for 113 Cyanobacterial genomes within eight Cyanobacterial clades. I show that predicted tRNA CIFs differ between Cyanobacterial clades in a phylogenetically informative way that can be exploited to accurately classify Cyanobacterial genomes using a machine-learning algorithm known as a multilayer perceptron (MLP), which we have named CYANO-MLP. I then use CYANO-MLP to test competing hypotheses of the origin of plastids by classifying 440 plastids genomes. I found support for the origin of plastids among a late-branching clade of starch-producing marine/freshwater diazotrophic cyanobacteria. Finally, I show that previously used phylogenetic models are unable to accommodate systematic biases possibly explaining conflicting hypotheses.
Third, I use tRNA CIFs to determine the phylogenetic placement of gnetophytes, a small clade of plants, within the seed plant phylogeny. The location of gnetophytes has been contentious with phylogenomic studies supporting several relationships with cone-bearing seed plants (conifers). Here I use the Jensen-Shannon divergence to calculate a pairwise distance matrix between seed plant clades for plastid tRNA CIFs. Using standard distance-based phylogenetic algorithms I found support for gnetophytes as sister to conifers.
Lastly, I describe the implementation of two software packages. The first is tsfm: tRNA structure function mapper, that provides methods for predicting tRNA CIFs. The second is a suite of tools modeled after GNU Textutils named, FAST: FAST Analysis of Sequences Toolbox, for processing of molecular sequence data on the command line.
Author
Advisor