- Laura Furlong, Hospital del Mar Research Institute, Spain
Presentation Overview: Show
We are in an unprecedented moment for disease genomics. The maps of the genomic architecture of human diseases, both complex and rare, are starting to be delineated. These maps are built upon the results of large scale GWAs and from the widespread use of next generation sequencing in clinical laboratories. To interpret the relevance of variants discovered by genomic studies, it is necessary to consider previous knowledge and link variants to other experimental results. The DisGeNET platform, by integrating genotype-phenotype association data from a wide variety of sources, aims at supporting this goal. DisGeNET contains a comprehensive catalog of human disease genes and variants, currently featuring over 117K variants associated to diseases and traits. It is supported by state-of-the-art text mining tools, allowing an accurate and comprehensive identification of diseases, genes and variants guided by standard ontologies and terminologies. The standardization of the information enables interoperability with other publicly available resources, allowing the exploration of the relationship between genomic features of variants (such as consequence type, allele frequency and functional impact) and the associated disease phenotypes. Scoring metrics allow the ranking of variants according to the evidence supporting the association to phenotype, or to assess the pleiotropy of the variant. DisGeNET covers the whole spectrum of human diseases (complex, mendelian and rare diseases), and includes information on genes and variants associated to disease symptoms and traits. As such, it enables to explore the shared genetic architecture among diseases, and between diseases and traits. We will present examples of how DisGeNET can support the interpretation of analysis such as GWAS and PheWas studies, and how can be integrated with molecular interaction data to uncover the functional impact of disease variants. Finally, a variety of tools and APIs are offered to allow users to interrogate the data, and to visualize the variant-disease associations in different formats.
- James Stephenson, EMBL-EBI, United Kingdom
- Roman Laskowski, EMBL-EBI, United Kingdom
- Matthew Hurles, Wellcome Sanger Institute, United Kingdom
- Janet Thornton, EMBL-EBI, United Kingdom
Presentation Overview: Show
The spatial distribution of disease associated variants in proteins can suggest mechanisms of action and can help to differentiate benign from damaging candidates. However, in rare diseases the sparsity of variants per protein reduces the power of spatial distribution analysis. We overcome this by enriching rare variant data by aligning similar structural (CATH) domains from different proteins. We then analyse the domains together and uncover any shape spatial pattern using DBSCAN spatial clustering.
Firstly, comparing the locations of clusters with those of known disease-associated variants and those considered benign can help to assign a pathogenicity probability to variants of unknown significance. Secondly, the location of the cluster in the protein or complex can suggest potential mechanisms of action such as the interruption of ligand binding, catalysis or structure destabilization/misfolding. Thirdly by comparing members of the same spatial clusters with gene lists of known disease association we can uncover new disease gene candidates.
Enrichment and spatial clustering of rare variants on protein structures can allow important regions of proteins to be uncovered, new disease associated genes to be found and can help to characterise variants of unknown significance.
- Lambert Moyon, Institut de biologie de l’Ecole normale supérieure (IBENS), Ecole normale supérieure, CNRS, INSERM, PSL Université, France
- Camille Berthelot, Institut de biologie de l’Ecole normale supérieure (IBENS), Ecole normale supérieure, CNRS, INSERM, PSL Université, France
- Hugues Roest Crollius, Institut de biologie de l’Ecole normale supérieure (IBENS), Ecole normale supérieure, CNRS, INSERM, PSL Université, France
Presentation Overview: Show
Whole-genome sequencing is increasingly used in patients with genetic diseases to diagnose causal mutations. However, for a large proportion of sequenced patients, no genes associated with the phenotype have a coding mutation. In these cases, a non-coding mutation, located in a cis-regulatory region, may affect the expression of a disease-associated gene.
Despite the existence of methods for annotating and predicting regulatory sequences, it remains difficult to define objective criteria to effectively select mutations among the millions of non-coding variants found in each patient. Additionally, the target genes of these regulatory regions are generally not known, hampering the ability to link a non-coding mutation with a patient's phenotype.
We propose here FINSURF, a supervised-learning strategy to classify non-coding mutations deregulating genes responsible for diseases. A notable innovation of our approach is to take into account data of associations between non-coding regions and target genes. We illustrate the potential of this method by analyzing 255,106 de novo mutations identified by whole-genome sequencing in 1,902 children with autism spectrum disorders.
Our method makes it possible to prioritize mutations in an informed way, and thus contributes to a better understanding of the mechanisms of gene expression regulation, and to an improved diagnosis for patients.
- Alexander Gress, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Germany
- Sebastian Keller, Max Planck Institute for Informatics, University of Saarland, Germany
- Vasily Ramensky, Federal State Institution "National Medical Research Center for Preventive Medicine", Russia
- Olga V. Kalinina, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Germany
Presentation Overview: Show
Prediction of phenotypes from organisms’ genotypes is one of the major challenges in biomedical science. Large amounts of clinical and genetic data make it nowadays approachable computationally. We focus on the nuclear problem: predicting phenotypic impact of individual genetic variants. Numerous lines of evidence suggest that there is a correlation between the variant-carrying gene’s identity and certain phenotypes, introducing a statistical bias. It can artificially inflate performance of methods in a setting when mutations, but not genes are randomly split into the training and test sets. Methods trained in such a way are likely to misclassify benign variants in pathogenicity-prone genes and to fail when predicting the impact of variants in genes not seen in the training.
We present a novel random forest-based machine learning method that employs features related to protein evolution and their three-dimensional structures. By applying it to deep mutational scan data and clinically annotated mutations from ClinVar, we demonstrate that including structure-related features and excluding features that may introduce protein bias, such as protein length or tendency to form homooligomeric complexes, improves the performance in a fair setting when genes as a whole, and not individual mutations, are split into the training and test sets.
- Alexandre Renaux, Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles, Vrije Universiteit Brussel, Belgium
- Sofia Papadimitriou, Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles, Vrije Universiteit Brussel, Belgium
- Nassim Versbraegen, Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles, Belgium
- Charlotte Nachtegael, Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles, Belgium
- Simon Boutry, Interuniversity Institute of Bioinformatics in Brussels, de Duve Institute - UCLouvain, Belgium
- Ann Nowé, Interuniversity Institute of Bioinformatics in Brussels, Vrije Universiteit Brussel, Belgium
- Guillaume Smits, Interuniversity Institute of Bioinformatics in Brussels, HUDERF, Center of Human Genetics - Hôpital Erasme, Belgium
- Tom Lenaerts, Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles, Vrije Universiteit Brussel, Belgium
Presentation Overview: Show
The vast amount of DNA sequencing data collected from large patient cohorts have helped in identifying a wide number of disease related mutations relevant for diagnosis and therapy. While existing bioinformatics methods and resources are mainly focusing on causal variants in Mendelian diseases, many difficulties remain to analyse more intricate genetic models involving variant combinations in different genes, an essential step for the discovery of the causes of oligogenic diseases. ORVAL (the Oligogenic Resource for Variant AnaLysis) tries to solve this problem by generating networks of pathogenic variant combinations in gene pairs, as opposed to isolated variants in unique genes. This online platform integrates innovative machine learning methods for combinatorial variant pathogenicity prediction and offers several interactive and exploratory tools, such as predicted pathogenicity and protein-protein interaction networks, a ranking of pathogenic gene pairs, as well as visual mappings of the cellular location and pathway information. ORVAL is the first web-based exploration platform dedicated to identifying networks of candidate pathogenic variant combinations to help clinicians and researchers in uncovering oligogenic causes for more complex diseases. ORVAL is available at https://orval.ibsquare.be.
- Julien Gagneur, Technical University of Munich, Germany
- Jun Cheng, Technical University of Munich / QBM Graduate School, Germany
Presentation Overview: Show
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. We foresee that MMSplice will be a useful tool to interpret variants of unknown significance in rare and common diseases.
- Saikat Banerjee, Max Planck Institute for Biophysical Chemistry, Germany
- Lingyao Zeng, German Heart Center, Munich, Germany
- Heribert Schunkert, German Heart Center, Munich, Germany
- Johannes Soeding, MPI BPC, Germany
Presentation Overview: Show
Multiple regression is widely used for post-GWAS analyses, especially for variant finemapping and re-ranking the nominally significant regions identified by GWAS. Bayesian multiple regression selects the variants using a sparsity-enforcing prior on the variant effect sizes to avoid over-training and integrate out the effect sizes for posterior inference. For case-control GWAS with binary disease status, the logistic model should perform significantly better than the linear model. Regardless, existing multiple regression methods approximate the logistic model with a linear function because otherwise the integration requires costly and technically challenging MCMC sampling.
We introduced the quasi-Laplace approximation to solve the integral and developed a software called Bayesian multiple LOgistic REgression (B-LORE). In extensive simulations, B-LORE outperformed existing methods whenever non-linearities are strong, e.g. B-LORE could extract more information simply by adding controls in a GWAS keeping the same number of cases. From a meta-analysis of five small GWAS for coronary artery disease (CAD), we applied B-LORE on the top 50 regions, which included 11 regions discovered by a 14-fold larger study (CARDIoGRAMplusC4D). B-LORE discovered all the 11 regions with >95% causal probablity, along with 12 novel regions, of which 9 are known to be associated with well-known CAD risk-related blood metabolic phenotypes.
- Janet Kelso, Max Planck Institute for Evolutionary Anthropology, Germany
Presentation Overview: Show
Recent technological advances have made it possible to recover genome sequences from a number of archaic and early modern humans. These genomes offer a unique opportunity to identify genetic changes that have come to fixation or reached high frequency in modern humans since the divergence from our common ancestor with Neandertals and Denisovans. We can also use ancient modern human genomes to directly track changes in allele frequency over time. Analyses of these archaic genomes have also provided direct evidence for interbreeding between early modern and archaic humans. As a result all present-day people outside of Africa carry approximately 2% Neandertal DNA, and some populations, largely in Oceania, also carry DNA from Denisovans. This introgressed DNA has been shown to have both positive and negative outcomes for present-day carriers: underlying apparently adaptive phenotypes as well as influencing disease risk. In recent work we have identified Neandertal haplotypes that are likely of archaic origin and determined the likely functional consequences of these haplotypes using public genome, gene expression, and phenotype datasets. We have also used simulations, as well as the distribution of Neandertal DNA in ancient modern humans, to understand how selection has acted on Neandertal introgressed sequences over the last 45,000 years.
- Ardalan Naseri, University of Central Florida, United States
- Erwin Holzhauser, University of Central Florida, United States
- Degui Zhi, University of Texas Health Science Center at Houston, United States
- Shaojie Zhang, University of Central Florida, United States
Presentation Overview: Show
Motivation: With the wide availability of whole genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a precomputed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple “good enough” matches are desired.
Results: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce “virtual insertion” of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match block with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search.
- Daniel Carlin, University of California San Diego, United States
- Dexter Pratt, University of California San Diego, United States
Presentation Overview: Show
We present an accessible, fast and customizable network propagation system for pathway boosting and interpretation of genome-wide association studies. This system – NAGA (Network Assisted Genomic Association) – taps the NDEx biological network resource to gain access to thousands of protein networks and select those most relevant and performative for a specific association study. The method works efficiently, completing genome-wide analysis in under five minutes on a modern laptop computer. We show that NAGA recovers many known disease genes from analysis of schizophrenia genetic data, and it substantially boosts associations with previously unappreciated genes such as amyloid beta precursor. On this and seven other gene-disease association tasks, NAGA outperforms conventional approaches in recovery of known disease genes and replicability of results. Protein interactions associated with disease are stored as networks in NDEx where they are readily visualized, annotated, and interpreted using desktop and web-based tools in the Cytoscape Cloud ecosystem, and where the data is programmatically accessible for further analysis.
- Erwin Frise, Fabric Genomics, United States
- Sahar Nohzadeh-Malakshah, Fabric Genomics, United States
- Marco Falcioni, Fabric Genomics, United States
- Edward Kiruluta, Fabric Genomics, United States
- Francisco De La Vega, Fabric Genomics, United States
Presentation Overview: Show
The ACMG/AMP evidence-based guidelines for variant pathogenicity assessment define several criteria assessing particular supporting evidence information. Criteria are combined to classify a variant as either pathogenic (P), likely-pathogenic (LP), benign (B), likely-benign (LB), or uncertain significance (VUS). Although widely adopted in clinical interpretation of variants this process has remained largely manual and time-consuming. Current informatics tools aimed to ease the application of the guidelines do not completely automate the entire process. Therefore, we developed a forward-chaining inference engine implementing the ACMG–AMP criteria and taking as input annotated variants and codified gene-condition curation to automatically infer the classification of variants. A natural language generation module in the engine provides explanatory text with the rationale of the classification for reference by clinical geneticists. Here we present a thorough performance evaluation of our method, analyzing a truth set of 37,491 previously classified variants for a hereditary cancer risk (15-gene), newborn screening (30-gene), and incidental findings (59-gene) panels. We show automatic classification of up to 95% and 77% of P/LP and B/LB variants with essentially no misclassifications. Unclassified variants are annotated with resolved criteria for rapid manual classification. This advance would allow clinical labs to scale-up and reduce effort in processing gene-panel tests.
- Alexander Kaplun, Variantyx Inc., United States
Presentation Overview: Show
WGS is well positioned to become the clinical diagnostic standard for rare genetic disorders due to the benefits inherent in PCR-free, genome-wide sequencing combined with its continually decreasing cost. Previously we’ve described components of our clinical WGS pipeline. We’ve presented our algorithms for detection of structural variants which use a combination of breakpoint analysis and read depth analysis. We’ve also presented our algorithms for detection of tandem repeat expansions which utilize de novo assembly of spanning reads, alignment and counting of repeats in anchored reads and counting and statistical normalization of reads containing only repeat sequence. Here we will provide a progress report on our complete clinical WGS pipeline, the first of its kind to be validated for small sequence changes, mitochondrial variants, structural variants and tandem repeat expansions. We will present validation data as well as discuss its latest features including: reconstruction of horizontal variants, pairing of compound heterozygous structural variants with regular variants, detection of insertions and utilization of tandem genome analysis for family planning.
- Iuliana Ionita-Laza, Columbia University, New York, United States
Presentation Overview: Show
Continuous advances in massively parallel sequencing technologies make large whole-genome sequencing studies increasingly feasible. The analysis of such data is challenging due to the large number of rare variants in noncoding regions of the genome and our limited understanding of their functional effects. In this talk I will discuss unsupervised and semi-supervised approaches to predict cell type/tissue specific regulatory function for variants in noncoding regions. I will also briefly discuss how to integrate a large number of functional predictions in sequence-based association tests for improved power to identify signals in noncoding regions. Throughout the talk I will show applications to several datasets.
- Bojian Yin, Centrum Wiskunde en Informatica, Netherlands
- Marleen Balvert, Centrum Wiskunde en Informatica/Utrecht University, Netherlands
- Rick A. A. van der Spek, UMC Utrecht, Netherlands
- Bas E. Dutilh, Utrecht University, Netherlands
- Sander Bohté, Centrum Wiskunde en Informatica, Netherlands
- Jan Veldink, UMC Utrecht, Netherlands
- Alexander Schoenhuth, Centrum Wiskunde en Informatica/Utrecht University, Netherlands
Presentation Overview: Show
Motivation: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease caused by aberrations in the genome. While several disease-causing variants have been identified, a major part of heritability remains unexplained. ALS is believed to have a complex genetic basis where non-additive combinations of variants constitute disease, which cannot be picked up using the linear models employed in classical genotype-phenotype association studies. Deep learning on the other hand is highly promising for identifying such complex relations. We therefore developed a deep-learning based approach for the classification of ALS patients versus healthy individuals from the Dutch cohort of the Project MinE dataset. Based on recent insight that regulatory regions harbour the majority of disease-associated variants, we employ a two-step approach: first promoter regions that are likely associated to ALS are identified, and second individuals are classified based on their genotype in the selected genomic regions. Both steps employ a deep convolutional neural network. The network architecture accounts for the structure of genome data by applying convolution only to parts of the data where this makes sense from a genomics perspective.
Results: Our approach identifies potentially ALS-associated promoter regions, and generally outperforms other classification methods. Test results support the hypothesis that non-additive combinations of variants contribute to ALS. Architectures and protocols developed are tailored towards processing population-scale, whole-genome data. We consider this a relevant first step towards deep learning assisted genotype-phenotype association in whole genome-sized data.
- Andrea Castro, University of California San Diego, United States
- Kivilcim Ozturk, UCSD, United States
- Hannah Carter, University of California San Diego, United States
Presentation Overview: Show
Major histocompatibility complex class I (MHC-I) is a protein complex that displays intracellular peptides to T-cells, allowing the immune system to recognize and destroy infected or cancerous cells. MHC-I is composed of a polymorphic HLA-encoded alpha chain and a Beta-2-microglobulin (B2M) protein that acts as a stabilizing scaffold. HLA mutations have been implicated as a mechanism of immune evasion during tumorigenesis, and B2M is considered a tumor suppressor. However, somatic HLA and B2M mutations have not been fully explored in the context of antigen presentation during tumor development. To understand the effect of MHC-I molecule mutations on mutagenesis, we analyzed the MHC-I molecule mutations in TCGA patients.
Somatic B2M and HLA mutations were associated with higher mutation burden and a larger fraction of HLA-binding neoantigens when compared to wild type tumors. B2M mutations occurred relatively early during patients’ respective tumor development, whereas HLA mutations were early or late events. B2M and HLA mutated patients had higher levels of immune infiltration and cytotoxicity.
We found that somatic B2M and HLA mutations are a mechanism of immune evasion by demonstrating that such mutations are associated with a higher load of neoantigens that should be presented via MHC-I.
- Michal Sadowski, Centre of New Technologies, University of Warsaw, Poland
- Yijun Ruan, The Jackson Laboratory for Genomic Medicine, United States
- Dariusz Plewczynski, Centre of New Technologies, University of Warsaw, Poland
Presentation Overview: Show
Background: The number of reported examples of chromatin architecture alterations involved in regulation of gene transcription and in disease is increasing. However, no genome-wide testing was performed to asses the abundance of these events and their importance relative to other factors affecting genome regulation. This genome-wide study attempts to fill this lack by analyzing the impact of genetic variants identified in individuals from 26 human populations and in genome-wide association studies onto chromatin spatial organization.
Results: We assess the tendency of structural variants to accumulate in spatially interacting genomic segments and design a high-resolution computational algorithm to model chromatin conformational changes caused by structural variations. We show that differential gene transcription is closely linked to variation in chromatin interaction networks mediated by RNA polymerase II. We also demonstrate that CTCF-mediated interactions are well conserved across population, but enriched with disease-associated SNPs. Moreover, we find boundaries of topological domains as relatively frequent target of duplications, which suggests that these duplications can be an important evolutionary mechanism of genome spatial organization.
Conclusions: Altogether, this study assesses the critical impact of genetic variants on the higher-order organization of chromatin folding and provides unique insight into the mechanisms regulating transcription at the population scale.