In this dissertation, it will be shown that new and unconventional approaches to phylogenetic and classification problems using systems biological data and machine learning fare well against the standard practices in computational time, power, and accuracy. First, we introduce various themes in evolutionary biology, and explain the transfer RNA (tRNA) interactome.
Then, we describe a new way to classify individual organisms based on information from whole genomes. We begin by predicting features by which proteins identify tRNAs coined "Class Informative Features (CIFs)", which form a species-specific "identity code" using a functional information calculation utilizing Information theory and conditional probability. We predict different, but related, codes for different groups of organisms. Then we train an artificial neural network to recognize which code a new, unknown genome is most related to using only primary sequence data. We apply our method to SAR11, one of the most abundant bacterial clades in the world's oceans, and hypothesized to share a phylogenetic sistering with the last alphaproteobacterial mitochondrial ancestor. We find that different strains of SAR11 are more distantly related, both to each other and to mitochondria, than previously thought.
Next, we apply the same logic to the determination to the origin of the Plastid within the Cyanobacteria. We show that using Jensen-Shannon Information Difference calculations, we retrieve a tree which phylogenetically groups Plastids with Cyanobacteria not classically thought to be associated with the cyanobacterial chloroplast ancestor, We also show evidence for refuting classical cyanobacterial topologies. We have uncovered evidence in recent literature that shows mechanistic justification for our largest CIFs.
Finally, we investigate the trend of CIFs across the bacterial tree of life, showing that CIFs maintain a relatively consistent G+C content in all genomes that can be classified by order. This work has developed a pipeline to classify any fully sequenced bacterial genome into a user-defined bacterial order. With modification to the training of the classifier and better Leave-One-Out Cross-Validation of the scoring of the data, we expect that this method will be robust to biological and statistical variations in current tree-building methods.
Advisor