University
of Pittsburgh Carnegie Mellon University

Joint CMU-Pitt Ph.D. Program in Computational Biology

Robert F. Murphy and Ivet Bahar, Directors

Home
Background
History
Curriculum
Admissions
Training Faculty
Students
Journal Club
Seminar Series
Committees
Alternative Programs
 
Contact Us

Seminar Series Abstracts

Fridays at 11am.
Seminars hosted by Carnegie Mellon are in the Mellon Institute conference room.
Seminars hosted by Pitt are in BST3 room 6014

October 13, 2006 Chris Bailey-Kellogg Dartmouth University homepage

Embedding Computation in Studies of Protein Structure and Function

Research in my lab focuses on making computation an integral part of planning, conducting, and interpreting experiments for protein three-dimensional structure and function. In our system for rapid computational-experimental investigation of protein structure, algorithms optimize and interpret sets of relatively cheap and fast experiments, each providing relatively sparse and noisy, but also complementary, geometric information. In our approach to engineering variant proteins, algorithms first infer probabilistic models that integrate information about sequence, structure, and function, and then use these models to optimize site-directed recombination experiments. Finally, in our work on data analysis in nuclear magnetic resonance spectroscopy, computation enables new approaches that can handle sparse data and can guarantee consideration of all feasible interpretations. I will discuss these projects, the computational challenges and our approaches, and the results.
October 20, 2006Russell Schwartz Carnegie Mellon homepage

Near-perfect Phylogeny Construction from Genetic Variation Data

The recent accumulation of large amounts of genotype data from human populations has created an unprecedented opportunity for inferring the history of the human species, identifying the forces shaping our genome, and finding links between genotype and phenotype. Making full use of these data will, however, require substantial improvements in algorithms for large-scale phylogenetic inference. Fast algorithms are available for inferring maximally parsimonious mutational histories from genetic variation data under various restrictive assumptions. The "perfect phylogeny" model, which assumes that each genetic variant appears only once within a population's history, has been particularly important to developing phylogenetic algorithms suitable for high-throughput analysis. But such restrictive models are not robust to the variability observed in real data sets. Heuristic methods, which may give solutions far from optimal, thus remain the standard in practice. In this talk, we will cover recent developments in algorithms for optimal phylogenetic inference based on the search for "near-perfect" phylogenies, those that come close to satisfying the perfect phylogeny assumption. Allowing even small deviations from perfection allows the models to describe a substantially broader group of genotype data. We will examine a recent algorithm proving fixed parameter tractability for the problem of binary parsimony-based inference when the optimal phylogeny differs from a perfect phylogeny by a small number of additional mutations. We will then cover a method for unphased genotype data, a variation of considerable practical importance in which contributions from homologous chromosomes are not separated in the input data. This problem can also be solved in polynomial time when the deviation from perfection is bounded. Finally, we will see some applications of these methods to real variation data. This work was joint with Srinath Sridhar, Kedar Dhamdhere, Guy Blelloch, Eran Halperin, and R. Ravi.
November 3, 2006 Lillian Chong University of Pittsburgh homepage

Exploring protein recognition events using theory and simulation

The central goal of my lab is to use theory and simulation to understand how proteins fold, bind their partners, and catalyze reactions, with an emphasis on how malfunctions at the molecular level can be linked to clinical data for various diseases. To achieve this goal, we develop accurate approaches for simulation and subsequent analysis of protein structure and function. In this talk, I will present my postdoctoral research involving the use of distributed computing and atomistic molecular dynamics simulation approaches for studying conformational changes that occur in protein binding events. I will also discuss my current research interests involving the study of unstructured proteins.
November 10, 2006 Nathan Urban Carnegie Mellon homepage

Noise-induced neuronal synchronization

Neurons work more effectively when they are active together. Simultaneous firing, especially oscillatory firing, is a common feature of brain activity in many areas and across many species. We are interested in uncovering computational and biophysical mechanisms of such synchronization. Using combined computational and physiological approaches we have determined which aspects of neuronal dynamics, synaptic properties and anatomical connectivity are critical for the generation of synchronized activity in large networks of neurons and have described a mechanism whereby aperiodic fluctuations can generate widespread neuronal synchronization in the absence of synaptic connections. I will discuss this work and its application to other examples of biological synchronization.
November 17, 2006 Carlos J. Camacho University of Pittsburgh homepage
December 1, 2006 Ronald WetzelUniversity of Pittsburgh

Protein aggregation - challenges to our understanding of protein structure and behavior

Much of the beautiful experimental and computational work on protein structure and folding of the last two decades of the 20th Century was conducted on globular proteins. Even large complex structures like the ribosome differ more in size than in kind from relatively simple systems like ribonuclease. In contrast, polypeptides are revealing new and surprising idiosyncrasies in studies of the assembly, structures and structural dynamics of protein aggregates like amyloid. Work on the nucleation mechanism of polyglutamine aggregation important to Huntington's disease reveals a nucleus size of one, consistent with an extremely highly unfavorable folding event. Working with the Aß amyloid fibrils important to Alzheimer's disease, we have been able to quantify the consequences of mutations to fibril stability. These studies, coupled with counting of H-bonds in the amyloid network by hydrogen exchange, reveal a degree of structural plasticity within amyloid fibrils not normally observed in globular proteins, but not unusual in synthetic polymers. A full understanding of protein aggregation must include an appreciation of the monomer ensemble. Thus, destabilization of the monomer-aggregate equilibrium by an oligoproline sequence added to the C-terminus of an aggregating peptide appears to be due to stabilization of the monomer ensemble. The thermodynamics favoring polyglutamine aggregation can already be detected in studies of the conformational preferences of the monomer ensemble, which is disordered but condensed, consistent with a surprising aversion by the polyglutamine sequence to solvent contacts. Thus, amyloid studies are generating new challenges to computational biology that extend beyond the aggregates themselves to the amyloidogenic proteins that support their growth.
January 5, 2007 Serafim BatzoglouStanford University homepage

Models and algorithms for genomic sequences, proteins and networks of protein interactions

This talk has two parts: the first part is on new ways to model and analyze biological sequences; the second part describes methods for constructing and comparing protein interaction networks, which are emering as canonical data sets of the post-genomic era.
Algorithms for biological sequence analysis. One of the most fruitful developments in bioinformatics in the past decade was the wide adoption of Hidden Markov Models (HMMs) and related graphical models to an array of applications such as gene finding, sequence alignment, and non-coding RNA folding. Conditional Random Fields (CRFs) are a recent alternative to HMMs, and provide two main advantages: (1) they enable more elaborate modeling of biosequences by allowing us to conveniently describe and select rich feature sets. For example, when comparing two residues during protein alignment, using a CRF allows leveraging in a principled manner the chemical properties of the neighborhood of those residues. (2) CRFs allow training of parameters in a way that is more effective for making predictions on new input sequences. I will describe three practical CRF-based tools that improve upon state-of-the-art methods in terms of accuracy: CONTRAlign, a protein aligner; CONTRAST, a gene finder; and CONTRAfold, a method for predicting the secondary structure of non-coding RNAs. Our tools are available at http://contra.stanford.edu
Networks of protein interactions. Graphs that summarize pairwise interactions between all proteins of an organism have emerged as canonical data sets that can be constructed using multiple sources of functional genomic data. We construct protein interaction networks for all sequenced microbes by integrating information extracted from genomic sequences as well as microarrays and other predictors of pairwise interactions. We then align these networks in multiple species using Graemlin, a tool that we developed for that purpose, and search for modules (subgraphs) of proteins that exhibit homology as well as conservation of pairwise interactions among many organisms. Graemlin provides substantial speed and sensitivity gains compared to previous network alignment methods; it can be used to compare microbial networks at http://graemlin.stanford.edu
January 19, 2007 Andy WalshCarnegie Mellon University homepage

Visualization and Exploration of Large Multiple Sequence Alignments

As the number of available biological sequences grows, new tools will be needed to visualize and explore the large quantities of data in a meaningful way. This is particularly evident in the field of virology, where tens of thousands of sequences are already available for certain viruses. This means that in some cases there are hundreds of unique sequences for single proteins in specific viruses, so that alignment of these sequences is meaningful and potentially fruitful. Here I will present a tool specifically designed for the challenges of visualizing and exploring an alignment containing any number of protein sequences. Many of the features of this tool will be demonstrated; these include multiple simultaneous linked views, displays which emphasize distributions over lists of data, and views highlighting physicochemical properties. I will also show some applications of the tool to studies of gp160 from HIV-1; these include studies focused on the unique properties of certain strains and on the universal properties of all strains.
February 2, 2007 Ziv Bar-JosephCarnegie Mellon University homepage

Data integration for understanding dynamic biological systems

A variety of high throughput experimental methods are enabling researchers to obtain a global view of dynamics and interactions of proteins in the cell. However, these datasets are often noisy and each represents only one aspect of cellular activity. By combining different types of data from multiple species we can overcome the limitations of each of the datasets and infer a more accurate picture of the activity and interactions in the cell. In this talk I will discuss the applications of data integration for identifying the differences in cycling genes between cancer and normal human cells and for determining a core set of cycling genes from multiple species. I will also present a new method for reconstructing dynamic regulatory networks with application to modeling yeast response to stress.
February 9, 2007 Tao Jiang University of California, Riverside homepage

A High-Throughput Combinatorial Approach to Genome-Wide Ortholog Assignment

The assignment of orthologous genes between a pair of genomes is a fundamental and challenging problem in comparative genomics. Existing methods that assign orthologs based on the similarity between DNA or protein sequences may make erroneous assignments when sequence similarity does not clearly delineate the evolutionary relationship among genes of the same families. In this paper, we present a new approach to ortholog assignment that takes into account both sequence similarity and evolutionary events at genome level, where orthologous genes are assumed to correspond to each other in the most parsimonious evolving scenario under genome rearrangement and gene duplication. It is then formulated as a problem of computing the signed reversal distance with duplicates between two genomes of interest, for which an efficient heuristic algorithm was given by introducing two new optimization problems, minimum common partition and maximum cycle decomposition. Following this approach, we have implemented a high-throughput system for assigning orthologs on a genome scale, called MSOAR, and tested it on both simulated data and real genome sequence data. Our predicted orthologs between the human and mouse genomes are strongly supported by ortholog and protein function information in authorative databases, and predictions made by other key ortholog assignment methods such as Ensembl, Homologene, INPARANOID, and HGNC. The simulation results demonstrate that MSOAR in general performs better than the iterated exemplar algorithm of David Sankoff's in terms of identifying true exemplar genes.
This is joint work with X. Chen, Z. Fu, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi.
February 23, 2007 Arijit Chakravarty Millenium Pharmaceuticals

How much got there and what did it do: computational and mathematical techniques in Oncology drug discovery

The process of rational drug discovery and development hinges on answering three critical questions correctly- "What is the right molecular target?", "What is the right compound for the inhibition of this target?", and "What is the right dose for this compound in humans?". Of these questions, the third is often the most critical, since it occurs at the transition to the clinical trials, a set of enormously complex and expensive experiments marking the culmination of the drug development process. In my role as a preclinical pharmacologist on drug discovery and development teams, I have worked to identify the Mechanism of Action of anticancer therapeutics, and translate that Mechanism of Action into a dosing and development strategy. Although this work is deeply rooted in the fundamental biology of the disease, and often very "wet", computational techniques play a crucial supporting role. For example, we use high-content assays in tissue to assess the effect of the drug in the body, some of which rely on the development of formal machine learning classifiers. As the robustness of these assays is key to the decision-making process, we validate such assays using formal statistical techniques such as bootstrapping and variance components analysis. The results of such assays are coupled with simulation and modeling techniques (pharmacodynamic/ pharmacokinetic modeling) to project the effects of particular doses or dose schedules. In my talk, I will provide an overview of the drug discovery process and an overview of the application of computational techniques in preclinical pharmacology. I will also present vignettes and case studies from my own work demonstrating the application of these techniques.
March 2, 2007 Gregory Voth University of Utah homepage

The Multiscale Challenge for Biomolecular Systems: A Systematic Approach

A multiscale theoretical and computational methodology will be presented for characterizing biomolecular systems and assemblies across multiple length- and time-scales. The approach provides a connection between atomistic molecular dynamics, reduced mesoscopic models, and near continuum-scale mechanics. At the heart of the methodology is a new and systematic multiscale coarse-graining theory for linking the atomistic-scale interactions to the mesoscale and beyond. Applications of the overall approach will be given for membranes, peptides, and proteins.
March 16, 2007 Elodie Ghedin University of Pittsburgh

Whole-Genome Analyses and Evolutionary Dynamics of Influenza A

With the advent of high throughput genomic methods to sequence complete human and avian influenza strains, expanded datasets are now available for in-depth analysis of viral evolution dynamics. Using comparative genomic analyses of these large sequence datasets, we highlight mutations of interest in all 8 segments of influenza. We specifically focus on characterizing regions associated with pathogenicity, reemergence of strains, and balanced mutations involved in functional cooperativity between proteins. Using a novel single sequence analysis approach called Thermodynamic Tolerance (TT) we also study mutations leading to drug resistance and adaptation to new hosts. This method calculates at various positions along the length of a gene the number of alternative sequences that would have the same thermodynamic context. TT-analysis gives an indication of how tolerant to change (i.e mutations) certain regions of a gene are for functionality to be conserved. By using this method we can quantify the long-range correlations of mutations and ascertain their effect on active sites. By means of plasmid-based reverse genetics and site-directed mutagenesis we can then genetically engineer influenza A viruses in order to validate in vitro the effect of mutations on the sites of interest.
March 30, 2007 Ivan Maly University of Pittsburgh homepage

How T-Killer Cells deliver the Kiss of Death: A Quantitative Systems Perspective

T cells of the immune system bind individual virus-infected or tumor cells and are able to kill them one by one, directionally, in the crowded tissue environment. The directionality of the T-cell response arises from structural polarization of the microtubule-organizing center and the associated membraneous organelles in the T cell toward the interface with the cell that is the intended target. This talk will address the mechanistic origin of the functional polarization in T cells. Quantitative theories as well as measurements made in a simplified experimental model will be presented. We will specifically address the kinematics of polarization as revealed by three-dimensional, live-cell microscopy, the kinetics of receptor redistribution in the T cell, and the mechanical optimization of the T cell structure. The main thesis of the talk will be that the polarization of the microtubule-organizing center is achieved through two distinct mechanisms, both of which are different from the paradigmatic intracellular-migration mechanism.
April 13, 2007 Jing Li Case Western Reserve University homepage

Multi-Stage Design and Multi-Stage Analysis of Genome-Wide Association Studies

With the completion of the international HapMap project and with recent advances in genotyping technology, large-scale genome-wide association studies (GWAS) for complex diseases are increasingly common. However, great statistical challenges still exist in testing hundreds of thousands SNPs in the context of mapping complex diseases. In this talk, I will discuss some of our recent work for GWAS based on a multi-stage design and/or a multi-stage analysis. In a multi-stage design, only a fraction of samples are genotyped and tested using a dense set of SNPs in the first stage, and only a small subset of markers that show moderate associations with the disease will be genotyped in the second (or later) stage(s). In a multi-stage analysis, single-locus based approaches are performed in the first stage and gene-gene interactions are evaluated in later stages. I will mainly focus on SNP subset selection methods that prompt SNPs from one stage to later stages.
April 20, 2007 Ron Dror D.E. Shaw Research homepage

Scalable Algorithms for Fast Molecular Dynamics Simulations

Although molecular dynamics (MD) simulations of biomolecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach. Such events include functionally important changes in protein structures, folding of proteins to their native three-dimensional structures, and interactions between proteins or between proteins and candidate drug molecules. We present several new algorithms that significantly accelerate parallel MD simulations compared with current state-of-the-art codes. These methods are embodied in a newly developed MD code called Desmond that achieves unprecedented simulation throughput and parallel scalability on commodity clusters. For example, on a standard benchmark, Desmond's performance on a conventional Opteron cluster with 2K processors slightly exceeded the reported performance of IBM's Blue Gene/L machine with 32K processors running its Blue Matter MD code. This performance boost has allowed us to tackle a broader range of biological problems by simulation.

This is joint work with Kevin Bowers, Edmond Chow, Huafeng Xu, Michael Eastwood, Brent Gregersen, Morten Jensen, John Klepeis, Istvan Kolossvary, Mark Moraes, Federico Sacerdoti, John Salmon, Yibing Shan, and David Shaw

April 27, 2007 David J. States University of Michigan homepage

Integrating Genomics and Proteomics: Novel Translation Products Identified by Mass Spectrometry

Much of what we know about the universe of proteins has been inferred from nucleic acid sequence data. Mass spectrometry based proteomics is emerging as a powerful new analytical technology, but there are multiple levels of disambiguation required to link MS spectra to protein products. Issues in data integration and statistical assessment of identification significance are discussed. Examples will be drawn from cancer proteomics.
May 4, 2007 Joel Bader Johns Hopkins University homepage

Exploratory Data Analysis for Biological Networks

Genome sequencing has become easy, but understanding how genes and proteins assemble into a network remains hard. I will discuss our group's work on this problem. We have been developing data-mining algorithms for unsupervised prediction of gene modules from heterogeneous data. For supervised queries, we have been developing new methods based on graph diffusion kernels similar to Google's PageRank algorithm. In other work, we have been developing a computational scheme to compute gene regulatory network wiring diagrams directly from DNA and protein sequence using all-atom simulations. Finally, I will describe an extension of capture- recapture statistics that we have developed to estimate false- positive and false-negative rates for network edges (biological interactions) revealed by noisy high-throughput experiments.