Thesis Defense: Xiaoyue Cui | August 20, 2025 | 9am

CBD and CPCB are proud to announce the following Thesis Defense:

TITLE: Exploring the Design Rules of Multidomain Architectures

Xiaoyue Cui
Wednesday, August 20th
9:00AM, EST
GHC 7101

Committee:
Dannie Durand, Chair, CMU
Maria Chikina, Pitt
Dennis Kostka, Pitt
Alex Bateman, EMBL-EBI

Abstract:

Multidomain proteins are mosaics of structural or functional units, called domains. Domain architectures, the sequences of domains in the protein in N- to C-terminal order, change via the gain and loss of domain-encoding segments over the course of evolution. In theory, these processes can generate any domain combinations. In practice, only a tiny fraction of all possible domain combinations occur in nature, suggesting that domain order and co-occurrence are highly constrained. Here, I investigate the design rules that govern multidomain architectures from multiple perspectives, combining comparative genomics, probabilistic modeling, and embedding-based analyses.
I begin by re-examining the exon shuffling hypothesis, one of the earliest and widely studied hypotheses in protein evolution, which posits that intronic recombination creates new domain combinations, facilitating the evolution of novel protein function. Recent genome-wide studies reported conflicting conclusions. One reason for the different conclusions might be the use of an unrealistic statistical model that results in overconfidence about the significance of the results. I develop accurate, tractable null models that support genome-scale tests of this hypothesis. When applied to metazoan and fungal genomes, my tests find strong evidence for exon shuffling in metazoa, but not in fungi. I conclude that support for exon shuffling outside of metazoa reported in prior studies may have been exaggerated by the use of unrealistic models. Rather, exon shuffling in eukaryotic genomes is still an open question deserving of further investigation.
Next, I introduce a simulator for domain architecture evolution, designed to capture the stringent constraints on domain order and co-occurrence observed in nature. The simulator estimates the probability of a given domain architecture using a bigram model, in which each domain's probability depends on its immediate predecessor. These probabilities are incorporated into a Metropolis-Hastings framework, where states correspond to domain architectures, transitions correspond to domain insertion and deletion events, and proposed events are accepted based on the probabilities given by the bigram model. Applied to metazoan datasets, the simulator generates simulated architectures that recapitulate key properties of genuine proteomes.
The bigram model underlying the simulator is estimated directly from observed data and is therefore subject to sparsity: the majority of domain bigrams are never observed. Some bigrams may be biologically disadvantageous and therefore absent from nature, while others may be absent due to incompleteness of data. To address this, I develop a unified smoothing framework based on interpolation, which can be tuned to accommodate different characteristics of bigram count data. I design several model variants within this framework that assign low probabilities to unseen bigrams that are likely excluded by biological constraints, while making appropriate adjustments for incompleteness. I evaluate these models empirically and show that they preserve key signatures of multidomain data.
One property that is likely to influence selective pressures is protein function. To investigate the relationship between domain organization and protein function, I adapt vector embeddings, which account for local contextual signals. As a control, a simple way to compare domain content is Jaccard similarity, which captures the fraction of shared domains between architectures. I find that multidomain architectures that are close in the embedding space share more functional attributes than those selected based on Jaccard, suggesting that context is important for understanding the relationship between domain organization and protein function. Surprisingly, domain architectures with similar functions but no common domains are found in close proximity in the embedding space more often than expected by chance, suggesting the existence of domain ``synonyms''.
Finally, I apply the embedding framework to identify and characterize the ``forbidden zone'' of multidomain architectures: regions of embedding space populated by randomized architectures but absent from genuine proteomes. I find that forbidden domain architectures tend to exhibit two properties: localization incompatibility and atypical copy numbers of domains typically found in tandem repeats. I further consider cancer fusion proteins as an alternative negative set, as they represent atypical domain combinations that are not found in the genuine proteome.