- Elizabeth White, CU-Denver Anschutz Medical Campus, United States
- Larry Hunter, CU-Denver Anschutz Medical Campus, United States
- William Baumgartner, CU-Denver Anschutz Medical Campus, United States
- Michael Bada, CU-Denver Anschutz Medical Campus, United States
Presentation Overview: Show
As scientists accumulate more finely grained knowledge about biology, they still struggle with how to leverage this new information in ways that let us build hypotheses and frame alternative explanations. Our lab has built a system that combines many data sources into a coherent biological representation using Open Biological Ontologies and OWL semantics. This allows users to explore biological molecules and the relations that connect them in various processes and pathways.
Recent work has focused on incorporating information from UniProt, which contains detailed protein sequence features and variant information, with the functional relations described in Reactome to model proteins’ biochemical reactions and interactions. Integrating entities and relations from these sources into the knowledge base poses significant challenges, including when to recognize existing entities in the knowledge base and when to posit new ones; however, this additional information permits investigation of the processes that mediate modification, trafficking, and localization of proteins. Disruptions in these processes are key factors in many diseases: mislocalized or mismodified proteins can gain or lose function, coerce partners into pathological behavior, and otherwise cause varying degrees of havoc in the cell.
- Bonnie Berger, Massachusetts Institute of Technology, United States
- Yoonjoo Choi, Korea Advanced Institute of Science and Technology, South Korea
- Jacob Furlon, Dartmouth College, United States
- Ryan Amos, Princeton University, United States
- Karl Griswold, Dartmouth College, United States
- Chris Bailey-Kellogg, Dartmouth College, United States
Presentation Overview: Show
Motivation: Disruption of protein-protein interactions can mitigate antibody recognition of therapeutic proteins, yield monomeric forms of oligomeric proteins, and elucidate signaling mechanisms, among other applications. While designing affinity-enhancing mutations remains generally quite challenging, both statistically- and physically-based computational methods can precisely identify affinity-reducing mutations. In order to leverage this ability to design variants of a target protein with disrupted interactions, we developed the DisruPPI protein design method (DISRUpting Protein-Protein Interactions) to optimize combinations of mutations simultaneously for both disruption and stability, so that incorporated disruptive mutations do not inadvertently affect the target protein adversely.
Results: Two existing methods for predicting mutational effects on binding, FoldX and INT5, were demonstrated to be quite precise in selecting disruptive mutations from the SKEMPI and AB-Bind databases of experimentally determined changes in binding free energy. DisruPPI was implemented to use an INT5-based disruption score integrated with an AMBER-based stability assessment, and was applied to disrupt protein interactions in a set of different targets representing diverse applications. In retrospective evaluation with three different case studies, comparison of DisruPPI-designed variants to published experimental data showed that DisruPPI was able to identify more diverse interaction-disrupting and stability-preserving variants more efficiently and effectively than previous approaches. In prospective application to an interaction between enhanced green fluorescent protein (EGFP) and a nanobody, DisruPPI was used to design five EGFP variants, all of which were shown to have significantly reduced nanobody binding while maintaining function and thermostability. This demonstrates that DisruPPI may be readily utilized for effective removal of known epitopes of therapeutically-relevant proteins.
Availability: DisruPPI is implemented in the EpiSweep package, freely available under an academic use license.
- Maximilian Miller, Rutgers University, United States
- Daniel Vitale, Rutgers University, United States
- Liskin Swint-Kruse, University of Kansas Medical Center, United States
- Burkhard Rost, Technical University of Munich, Germany
- Yana Bromberg, Rutgers University, United States
Presentation Overview: Show
We recently described a new approach for classifying protein sequence positions to reflect the tune-ability of protein function. More specifically, we find that sequence positions segregate into two experimentally definable classes, those hosting variants of primarily binary on/off effects (toggle) vs. those with a range of changes (rheostat). We showed that variant effects are evaluated with different levels of accuracy at different position classes. To improve variant effect prediction, we developed a machine learning based approach to identify position classes in sequence. We extracted experimentally validated substitution effects for five protein sequences from literature and labeled sequence positions based on the distribution of mutational effect scores. We then trained a 2-step model using fourteen sequence-based features to predict position labels. Our validation showed high performance (accuracy > 82%) and resistance to changes in training data and to variation in selected features. Predicting the distribution of position types for all human enzymes we found some expected patterns (smaller aliphatic residues are often rheostats, cysteines act mostly as toggles), as well as some surprising outcomes (charged residues are often mutation-tolerant). These results suggest that integrating knowledge about positions types is a vital step towards developing more accurate variant effect prediction models.
- Barthélémy Caron, Institut Imagine, France
- Yufei Luo, Institut Imagine, France
- Antonio Rausell, Institut Imagine, France
Presentation Overview: Show
The study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates (~8-70%). Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in regulatory regions. However, state-of-the-art methods to prioritize non-coding variants have been characterized on variant sets largely composed of trait-associated polymorphisms and common diseases. In this work we first curated large collections of bona-fide pathogenic variants in proximal cis-regulatory regions leading to Mendelian diseases. We then systematically evaluated the ability to predict causal variants of an exhaustive set of genomic features extracted at three levels: the affected position, the flanking region and the affected gene. In addition to epigenetic features and inter-species conservation scores, a complete set of ongoing purifying selection signals in humans was explored. This represents a main novelty allowing to exploit sequence constraints potentially associated to recently acquired human regulatory elements. Our results show that a supervised learning using gradient tree boosting on the previously described sets of features outperforms current reference methods for prioritization of non-coding Mendelian disease variants. A detailed comparative benchmark is presented and results discussed in terms of the type of the targeted regulatory region.
- Adriana Sperlea, University of California, Los Angeles, United States
- Jason Ernst, University of California, Los Angeles, United States
Presentation Overview: Show
Comparative genomics sequence data is an important source of information for interpreting genomes. Genome-wide annotations based on this data have largely focused on univariate scores or binary calls of evolutionary constraint. Here we present a complementary whole genome annotation approach, ConsHMM, which applies a multivariate hidden Markov model to learn de novo different ‘conservation states’ based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment. We applied ConsHMM to a 100-way vertebrate sequence alignment to annotate the human genome at single nucleotide resolution into 100 different conservation states. These states have distinct enrichments for other genomic information including gene annotations, chromatin states, and repeat families, which were used to characterize their biological significance. Conservation states have greater or complementary predictive information than standard constraint based measures for a variety of genome annotations. Bases in constrained elements have distinct heritability enrichments depending on the conservation state assignment demonstrating their relevance to analyzing phenotypic associated variation. The conservation states also highlight similarities and differences between constrained bases identified based on inter and intra species approaches. ConsHMM conservation state annotations provide a valuable resource for interpreting genetic variation.
- Fabrizio Pucci, Université Libre de Bruxelles, Belgium
- Qingzhen Hou, Université Libre de Bruxelles, Belgium
- Jean Marc Kwasigroch, Université Libre de Bruxelles, Belgium
- Marianne Rooman, Université Libre de Bruxelles, Belgium
Presentation Overview: Show
In order to have a better comprehension of how missense mutations result in disease phenotypes it is nowadays clear that their impact has to be investigated in the context of biological networks. Here we present a multi-scale method to analyze the effect of mutations on the protein-protein interaction network (PPIN).
After the introduction of the statistical potentials, that are the key ingredients of our analysis at the molecular scale, we will briefly show how these structural energetic information can be combined to predict the change upon mutations of the folding free energy of a protein and of its binding affinity for an interacting partner.
The combination of these methods (called PoPMuSIC and BeAtMuSiC) will be then applied to the Interactome scale to predict the “edgotype” of a mutation, namely if the mutation induces a "node" removal in the PPIN, is likely to lead to an edgetic perturbation or it has essentially no effect on the network.
The systematic characterization of the mutation's "edgotype" is a fundamental step towards the understanding of the genotype-phenotype relations and lead to a deeper understanding of the perturbed Interactome that will be an invaluable asset in drug design for the choice of therapeutical strategies.
- Jennifer Poitras, QIAGEN, United States
- Farhad Hormozdiari, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Steven Gazal, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Bryce Van De Geijn, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Hilary Finucane, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Chelsea Ju, University of California, Los Angeles, United States
- Po-Ru Loh, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Armin Schoech, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Yakir Reshef, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Xuanyao Liu, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Luke O’connor, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Alexander Gusev, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
- Eleazar Eskin, University of California, Los Angeles, United States
- Alkes Pric, Program in Genetic Epidemiology and Statistical Genetics, Harvard University, United States
Presentation Overview: Show
There is increasing evidence that many GWAS risk loci are molecular QTL such as eQTL, hQTL, sQTL, and/or meQTL. Here, we introduce a new set of functional annotations based on fine-mapped molecular cis-QTL, using data from the GTEx and BLUEPRINT consortia. We show that these annotations are strongly enriched for disease heritability across 41 independent diseases and complex traits (average N=320K). eQTL annotations that were obtained by meta-analyzing all GTEx tissues generally performed best, but tissue-specific blood eQTL annotations produced stronger enrichments for autoimmune diseases and blood cell traits and tissue-specific brain eQTL annotations produced stronger enrichments for brain-related diseases and traits. Notably, eQTL annotations restricted to loss-of-function intolerant genes from ExAC were even more strongly enriched for disease heritability. All molecular QTL except sQTL remained significantly enriched in a joint analysis, implying that each of these annotations is uniquely informative for disease and complex trait architectures.
- Mona Singh
- Lisa Gai, University of California, Los Angeles, United States
- Eleazar Eskin, University of California, Los Angeles, United States
Presentation Overview: Show
Many variants identified by genome-wide association studies (GWAS) have been found to affect multiple traits, either directly or through shared pathways. There is currently a wealth of GWAS data collected in numerous phenotypes, and analyzing multiple traits at once can increase power to detect shared variant effects. However, traditional meta-analysis methods are not suitable for combining studies on different traits. When applied to dissimilar studies, these meta-analysis methods can be underpowered compared to univariate analysis. The degree to which traits share variant effects is often not known, and the vast majority of GWAS meta-analysis only consider one trait at a time. Here we present a flexible method for finding associated variants from GWAS summary statistics for multiple traits. Our method estimates the degree of shared effects between traits from the data. Using simulations, we show that our method properly controls the false positive rate and increases power when an effect is present in a subset of traits. We then apply our method to the North Finland Birth Cohort and UK Biobank data sets using a variety of metabolic traits and discover novel loci.
- Poulami Chaudhuri, Innovation Lab, Tata Consultancy services, India
- Akriti Jain, Innovation Lab, Tata Consultancy services, India
- Rajgopal Srinivasan, Innovation Lab, Tata Consultancy services, India
Presentation Overview: Show
Branchpoint sites are conserved functional elements that are critical to pre-mRNA splicing by processing intron-excision through lariat formation. Mutations at branchpoint have been shown to lead to aberrant splicing causing disease phenotypes. The explosion of the use of NGS in the clinic for diagnosis/screening of disorders would benefit from approaches that can reliably identify mutations in branchpoint sites. Development of such tools has been hampered by the absence of a large enough “gold dataset” of known high confident branchpoint sites. Recent work, that identified a high confident set of over 59,000 branch point sites provides an opportunity to rectify this lacuna. We have used this data set to build a PWM, combined with other established tools for assessing splicing consequences arising from mutations at branchpoint sites. We show that such an analysis approach using a PWM is successful in detecting putative branchpoints in the human genome and upon further screening through Clinvar can identify pathogenic branchpoint variants. The analysis not only recovers known pathogenic branchpoint mutations but also uncovers other potential branchpoint or splice acceptor site associated pathogenic variants. We conclude that such an approach can be part of variation annotation pipelines that are commonly used in NGS analysis.
- Rachel Marty, University of California San Diego, United States
- Wesley Thompson, University of California San Diego, United States
- Rany Salem, University of California San Diego, United States
- Maurizio Zanetti, University of California San Diego, United States
- Hannah Carter, University of California San Diego, United States
Presentation Overview: Show
The anti-cancer immune response against mutated peptides of immunological relevance (neoantigens) is primarily attributed to MHC-I-restricted cytotoxic CD8+ T-cell responses. MHC-II-restricted CD4+ T-cells also drive anti-tumor responses; however, their relation to neoantigen selection, cancer susceptibility and tumor evolution has not been systematically studied. To address this, we developed a score allowing interpretation of MHC-II variation-based genotype in the context of presentation on the cell surface. Computationally modeling the potential of an individual’s MHC-II genotype to present 1,018 cancer-causing mutations in 7,137 tumors, we demonstrate that MHC-II genotype constrains the somatic mutational landscape during tumorigenesis. Poor presentation by MHC-II increased the odds of observing a mutation, even more than MHC-I. Exploiting MHC-II and MHC-I genotype complementarily increased power to predict occurrence of mutations; however, overall precision was limited, suggesting that other factors are stronger determinants of specific mutations. While MHC-I genotype correlated with age at diagnosis, MHC-II showed no such correlation, consistent with a prevalent regulatory role of CD4+ T-cells. These results implicate the immune system as a key heritable risk factor for cancer.
- Linlin Zhao, Heinrich Heine University Düsseldorf, Germany
- Nima Abedpour, University of Cologne, Germany
- Christopher Blum, Heinrich Heine University Düsseldorf, Germany
- Petra Kolkhof, Heinrich Heine University Düsseldorf, Germany
- Mathias Beller, Heinrich Heine University Düsseldorf, Germany
- Markus Kollmann, Heinrich Heine University Düsseldorf, Germany
- Emidio Capriotti, University of Bologna, Italy
Presentation Overview: Show
The accurate characterization of the translational mechanism is crucial for enhancing our understanding of the relationship between genotype and phenotype. In particular, predicting the impact of the genetic variants on gene expression will allow to optimize specific pathways and functions for engineering new biological systems. In this work we present PGExpress, a new regression method for predicting the log2-fold-change of the translation efficiency of an mRNA sequence in E. coli. PGExpress algorithm takes as input 12 features corresponding to RNA folding and anti-Shine-Dalgarno hybridization free energies. The method was trained on a set of 1,772 sequence variants of 137 essential E. coli genes. For each gene, we considered 13 sequence variants of the first 33 nucleotides encoding for the same amino acids followed by the superfolder GFP.
Our gradient-boosting-based tool (PGExpress) was trained using a 10-fold gene-based cross-validation procedure on the WT-High dataset. In this test PGExpress achieved a correlation coefficient of 0.57, with a Root Mean Square Error (RMSE) of 1.4. When the regression task is cast in a classification problem, PGExpress reaches an overall accuracy of 0.73 a Matthews correlation coefficient 0.47 and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.80.
- Alex Kaplun, Variantyx
- Cong Ma, Carnegie Mellon University, United States
- Mingfu Shao, Carnegie Mellon University, United States
- Carl Kingsford, Carnegie Mellon University, United States
Presentation Overview: Show
Transcripts are frequently modified by structural variants, which lead to fused transcripts of either multiple genes— known as a fusion gene— or a gene and a previously non-transcribed sequence. These transcriptome modifications, collectively called transcriptomic structural variants (TSV), can lead to drastic changes in downstream products and become cancer drivers. Detecting TSVs is an important and challenging computational problem, especially when only RNA-seq measurements are available.
We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID attempts to rearrange genome segments to best explain the observed RNA-seq reads. TSVs are processed from the rearrangement result. Tested on two previously studied cell lines, SQUID achieves similar accuracy on fusion-gene detections as current fusion-gene detection methods, but with higher accuracy for non-fusion-gene detections. SQUID is open source and available at https://github.com/Kingsford-Group/squid.
Applying SQUID on TCGA tumor samples, we observe that non-fusion-gene TSVs are more likely to be intra-chromosomal than fusion-gene TSVs for multiple cancer types. Novel non-fusion gene TSVs are detected and involve tumor suppressor genes, such as ZFHX3 and ASXL1. It is reasonable to suspect that these TSVs may lead to loss-of-function in the corresponding tumor suppressor genes and play a role in tumorgenesis.
- Olga Troyanskaya, Princeton University, United States
- Hsuan-Lin Her, Taipei Medical University, Taiwan
- Yu-Wei Wu, Taipei Medical University, Taiwan
Presentation Overview: Show
Motivation: Antimicrobial resistance (AMR) is becoming a huge problem in both developed and de-veloping countries, and identifying strains resistant or susceptible to certain antibiotics is essential in fighting against antibiotic-resistant pathogens. Whole-genome sequences have been collected for different microbial strains in order to identify crucial characteristics that allow certain strains to become resistant to antibiotics; however a global inspection of the gene content responsible for AMR activities remains to be done.
Results: We propose a pan-genome-based approach to characterize antibiotic-resistant microbial strains and test this approach on the bacterial model organism Escherichia coli. By identifying core and accessory gene clusters and predicting AMR genes for the E. coli pan-genome, we not only showed that certain classes of genes are unevenly distributed between the core and accessory parts of the pan-genome but also demonstrated that only a portion of the identified AMR genes belong to the accessory genome. Application of machine learning algorithms to predict whether specific strains were resistant to antibiotic drugs yielded the best prediction accuracy for the set of AMR genes within the accessory part of the pan-genome, suggesting that these gene clusters were most crucial to AMR activities. Selecting subsets of AMR genes for different antibiotic drugs based on a genetic algorithm achieved better prediction performances than the gene sets established in the literature, hinting that the gene sets selected by the genetic algorithm may warrant further analysis in investigating more details about how E. coli fight against antibiotic drugs.
- Bjoern Stade, Fabric Genomics Inc., United States
- Melanie Babcock, Fabric Genomics Inc., United States
- Marco Falcioni, Fabric Genomics Inc., United States
- Edward Kiruluta, Fabric Genomics Inc., United States
- Martin Reese, Fabric Genomics Inc., United States
- Mark Yandell, University of Utah School of Medicine, United States
- Francisco M. De La Vega, Fabric Genomics Inc., United States
Presentation Overview: Show
Analysis of next-generation-sequencing of genomes/exomes (WGS) is important to the clinical diagnostics of Mendelian diseases. Previously, we developed a probabilistic approach, VAAST, to integrate variant impact and population prevalence. With Phevor we leverage disease phenotype ontology to combine the VAAST score with the patient phenotype to prioritize variants. Here we present a large-scale analysis of 2,408 clinical WGS diagnostics cases to quantify this approach. The Genomics England 100k genomes project recruited probands, and, if possible, family members, with rare diseases. We performed case analysis with with Opal Clinical software and in 40.8% of the cases, clinical geneticists were able to identify a candidate causative variant. Using VAAST and Phevor, we identified 644 (31.4%) candidate causative variants. We show that the candidate causative variants reported back by geneticists reside in the top 10 VAAST rank in 35% of cases, whereas they are found in 70% of the cases when ranked by Phevor. When the candidate causative variant fell out of the top 20 ranks, fewer and more ambiguous phenotypes were provided. Our results suggest that integration of phenotype priors improves causative variant prioritization 2-fold, suggesting that phenotypes have an important benefit to increase diagnostic yield.