All Highlights and Proceedings Track presentations are presented by scientific area part of the combined Paper Presentation schedule.
Alternative polyadenylationApplied BioinformaticsBioimaging & Data VisualizationBioinformaticsBioinformatics of Disease and TreatmentCancerComparative GenomicsCryo-electron tomographyDatabases & Ontologiesde Bruijn sequenceDisease Models & EpidemiologyDNA binding specificitydose-response analysisDrug-Target InteractionEpigeneticsEvolution & Comparative GenomicsFeature selectionFunctional GenomicsGene OntologyGene Regulation & TranscriptomicsGenetic Variation AnalysisGenome Organization and AnnotationHaematopoietic stem cellHaplotypeHost-pathogen protein interactionIdentity-by-DescentImage processingMass Spectrometry & ProteomicsMetabolic networkMetabolic networksMetabolic pathwaysmicroRNANetwork topologyNonparametric sparse Bayesian factor analysisParameter estimationPathway inferencePCAPoly(A) motifPopulation geneticsPopulation GenomicsPPI-NetworkPRM: Protein Recognition moduleProtein contact map predictionProtein interaction evolutionProtein Interactions & Molecular NetworksProtein Structure & FunctionProtein Structure and Function Prediction and AnalProtein structure predictionProtein ThreadingProteomicsPseudogeneRNARNA structure predictionSequence AnalysisSequencingShort read alignmentSystems Biology and NetworksText miningTranscriptome assemblingOther
Presenting author: , ,
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 9
Area Session Chair: Geoff Barton
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 9
Area Session Chair: Geoff Barton
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:30 a.m. - 11:55 p.m.Room: Hall 9
Area Session Chair: Geoff Barton
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:30 a.m. - 12:25 p.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: Hall 9
Area Session Chair: Geoff Barton
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:40 p.m. - 3:35 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 10
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 11:30 a.m. - 12:25 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 11:30 a.m. - 11:55 p.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 9
Area Session Chair: Johannes Soedling
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 2:40 p.m. - 3:35 p.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 9
Area Session Chair: Johannes Soedling
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 9
Area Session Chair: Johannes Soedling
Presentation Overview: TOP
Presenting author: , ,
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 10
Area Session Chair: Christophe Blanchet
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 9
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 9
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 11:30 a.m. - 12:25 p.m.Room: Hall 9
Area Session Chair: Dominic Clark
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 2:10 p.m. - 3:05 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 10
Area Session Chair: Johannes Soedling
Presentation Overview: TOP
Presenting author: , ,
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 9
Area Session Chair: Rodrigo Lopez
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
CancelledPresenting author: , ,
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: , ,
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: ICC Lounge 81
Presentation Overview: TOP
Presenting author: Gil Ast , Tel Aviv University, Israel
Sunday, July 21: 9:00 AM - 10:00AMRoom: Hall 1
Presentation Overview: TOP
Presenting author: Gary Stormo , Washington University in St. Louis, United States
Monday, July 22: 4:35 - 5:35Room: Hall 1
Presentation Overview: TOP
Presenting author: Lior Pachter , University of California, Berkeley, United States
Monday, July 22: 9:00 AM - 10:00 AMRoom: TBA
Presentation Overview: TOP
Presenting author: Carole Goble , University of Manchester, United Kingdom
Tuesday, July 23: 9:00 AM - 10:00 AMRoom: Hall 1
Presentation Overview: TOP
Presenting author: David Eisenberg , UCLA, United States
Tuesday, July 23: 4:35 PM - 5:35 PMRoom: Hall 1
Presentation Overview: TOP
Presenting author: Hayssam Soueidan , NKI-AVL, United States
Monday, July 22: 11:00 a.m. -11:25 a.m.Room: Hall 1
Presentation Overview: TOP
Presenting author: Marcus Krantz , Humboldt University, United States
Monday, July 22: 12:00 p.m.-12:25 p.m.Room: Hall 1
Presentation Overview: TOP
Presenting author: Janet Thornton , EMBL-EBI, United Kingdom
Tuesday, July 23: 2:10 p.m.-2:35 p.m.Room: Hall 1
Presentation Overview: TOP
Presenting author: Niklas Blomberg , ,
Tuesday, July 23: 2:10 p.m.-2:35 p.m.Room: Hall 1
Presentation Overview: TOP
Presenting author: Bengt Persson , Linköping University, Sweden
Tuesday, July 23: 2:10 p.m.-2:35 p.m.Room: Hall 1
Presentation Overview: TOP Proceedings Track: Alternative polyadenylation
Presenting author: Dina Hafez , Duke University, United States
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 14.2
Area Session Chair: Cenk Sahinalp
Presentation Overview: Motivation:
Pre-mRNA cleavage and polyadenylation is an essential step for 3' end maturation, and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with varying 3'UTRs, thus affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered both by the lack of suitable data on the precise location of cleavage sites, as well as of appropriate tests for determining APAs with significant differences across multiple libraries.
Results:
We applied a tailored paired-end RNA-seq protocol to specifically probe the position of polyA sites in three adult cell types. We specified a linear effects regression model to identify tissue-specific biases indicating regulated alternative polyadenylation; the significance of differences between cell types was assessed by an appropriately designed permutation test. This combination allowed to identify highly specific subsets of APA events in the individual cell types. Predictive models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6\%), as well as tissue-specific regulated sets from each other. We found that the main cis-regulatory elements described for polyadenylation are a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific polyA sites. Together, our results contribute to the understanding of the diversity of post-transcriptional gene regulation.
TOP Proceedings Track: Applied Bioinformatics
Presenting author: Hector Corrada Bravo , University of Maryland, us
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 7
Area Session Chair: Ivo Hofacker
Presentation Overview: TOP
Presenting author: Carlo Vittorio Cannistraci , King Abdullah University of Science and Technology (KAUST), sa
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 7
Area Session Chair: Predrag Radivojac
Presentation Overview: TOP
Presenting author: Livnat Jerby Arnon , Tel Aviv University, il
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: Hall 14.2
Area Session Chair: Russell Schwartz
Presentation Overview: TOP
Presenting author: Andrew Smith , University of Southern California, us
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: TOP
Presenting author: Chen-Hsiang Yeang , Academia Sinica, tw
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: Hall 7
Area Session Chair: Predrag Radivojac
Presentation Overview: TOP
Presenting author: Yves Lussier , The University of Illinois, us
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 14.2
Area Session Chair: Serafim Batzoglou
Presentation Overview: TOP
Presenting author: Philippe Sanseau , GlaxoSmithKline, uk
Tuesday, July 23: 2:10 p.m. - 2:35 p.m.Room: ICC Lounge 81
Area Session Chair: Donna Slonim
Presentation Overview: TOP
Presenting author: Stefan Kramer , Johannes Gutenberg University Mainz, de
Tuesday, July 23: 11:30 a.m. - 11:55 p.m.Room: Hall 7
Area Session Chair: Thomas Lengauer
Presentation Overview: TOP
Presenting author: Richard Lathrop , University of California, Irvine, us
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 14.2
Area Session Chair: Russell Schwartz
Presentation Overview: TOP Proceedings Track: Bioimaging & Data Visualization
Presenting author: Manuel Corpas , The Genome Analysis Centre, uk
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 7
Area Session Chair: Ivo Hofacker
Presentation Overview: TOP
Presenting author: Nils Gehlenborg , Harvard Medical School, us
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 7
Area Session Chair: Thomas Lengauer
Presentation Overview: TOP
Presenting author: Nikolaus Schultz , Memorial Sloan-Kettering Cancer Center, us
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 7
Area Session Chair: Thomas Lengauer
Presentation Overview: TOP Proceedings Track: Bioinformatics
Presenting author: Sarah Aerni , Stanford University, United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 7
Area Session Chair: Stefan Kramer
Presentation Overview: Motivation:
Advances in high-resolution microscopy have recently made possible the analysis of gene expression at the level of individual cells. The fixed lineage of cells in the adult worm C. elegans makes this organism an ideal model for studying complex biological processes like development and aging. However, annotating individual cells in images of adult C. elegans typically requires expertise and significant manual effort. Automation of this task is therefore critical to enabling high-resolution studies of a large number of genes.
Results:
In this paper, we describe an automated method for annotating a subset of 154 cells (including various muscle, intestinal, and hypodermal cells) in high-resolution images of adult C. elegans. We formulate the task of labeling cells within an image as a combinatorial optimization problem, where the goal is to minimize a scoring function that compares cells in a test input image with cells from a training atlas of manually annotated worms according to various spatial and morphological characteristics. We propose an approach for solving this problem based on reduction to minimum-cost maximum flow and apply a cross-entropy based learning algorithm to tune the weights of our scoring function. We achieve 84% median accuracy across a set of 154 cell labels in this highly variable system.These results demonstrate the feasibility of the automatic annotation of microscopy-based images in adult C. elegans.
TOP Proceedings Track: Bioinformatics of Disease and Treatment
Presenting author: Matan Hofree , UCSD, us
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 15.2
Presentation Overview: Many forms of cancer consist of multiple subtypes with different molecular causes and clinical outcomes. Somatic tumor genomes provide a rich new source of data for uncovering these subtypes, but have proven difficult to compare as two tumors rarely share the same mutations. Here, we introduce ‘network-based stratification’(NBS) which integrates somatic tumor genomes with gene networks. This approach allows for stratification of cancer into informative subtypes by clustering together patients who have mutations within similar network regions. We demonstrate the validity of this approach in simulation. Next, we apply the method to somatic mutation data from three cancer patient cohorts collected as part of The Cancer Genome Atlas - ovarian cancer(OV), breast cancer(BRCA) and uterine cancer(UCEC) and are able to discover a robust cluster assignment significantly associated with important clinical phenotypes. In BRCA we recover subtypes significantly correlated with known subtypes and other clinical makers. In UCEC subtypes segregate patients into distinct sets enriched for tumor grade and histology. In OV subtypes are associated with patient survival and acquired resistance to platinum chemotherapy. We use the OV subtypes to define a predictive signature based on gene expression which successfully recovers the somatic mutation derived subtypes in an independent expression cohort. Finally, we use the subtypes derived in each cohort to highlight potentially dysregulated subnetworks characteristic of each mutation derived subtypes. This study provides a proof of principle for the utility of combining somatic mutation genotypes with interaction networks, enabling the discovery of clinically meaningful mutation based subtypes.
TOP
Presenting author: Maricel Kann , UMBC, us
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 15.2
Presentation Overview: The body of disease mutations with known phenotypic relevance continues to increase and is expected to do so even faster with the advent of new experimental techniques such as whole- genome sequencing coupled with disease association studies. However, genomic association studies are limited by the molecular complexity of the phenotype being studied and the population size needed to have adequate statistical power. One way around this problem, which is critical for the study of rare diseases, is to study the functional patterns of known disease mutations. We have previously shown that the functional patterns of known human disease mutations have a significant tendency to cluster at protein domain positions, namely position-based domain hotspots of disease mutations. However, the limited number of known disease mutations remains the main factor hindering the advancement of mutation studies at a functional level. In this paper, we address this problem by incorporating mutations known to be disruptive of phenotypes in other species. Focusing on two evolutionarily distant organisms, human and yeast, we describe the first inter-species analysis of mutations of phenotypic relevance at the protein domain level. Our results show that phenotypic mutations from yeast cluster at specific positions on protein domains, a characteristic previously revealed to be displayed by human disease mutations. This first-of-a-kind study of phenotypically relevant yeast mutations in relation to human disease mutations demonstrates the utility of a multi-species analysis for advancing the understanding of the relationship between genetic mutations and phenotypic changes at the organismal level.
TOP
Presenting author: Roland Schwarz , European Molecular Biology Laboratory, uk
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 15.2
Presentation Overview:
Intra-tumour heterogeneity (ITH) is currently the focus of cancer
research due to its implications for disease progression, resistance
development and its impact on personalised medicine
approaches. Understanding the aetiology of ITH involves reconstructing
the evolutionary history of cancer within the patient. Especially with
respect to genomic rearrangements this is impeded by changing
cellularity, unknown phasing of genomic variants and the fact that
genomic rearrangement events cover large often overlapping segments of
the genome.
In this study we have assembled a novel clinical dataset of 170 copy
number (CN) profiles from 20 patients undergoing neoadjuvant
chemotherapy for high-grade serous ovarian cancer. Patients were
sampled at multiple distinct sites at biopsy, interval debulking
surgery and relapse. We have developed MEDICC, a novel phylogenetic
method for reconstruction of evolutionary trees based on genomic
rearrangements. Employing state-of-the art machine learning techniques
we phase parental alleles, reconstruct trees and ancestral genomes and
at the same time numerically quantify the degree of ITH and clonal
expansion in each patient. Correlation of these indices with clinical
endpoints such as progression free survival shows how the amount of
genomic change in the course of chemotherapy, and the degree of clonal
expansion determine patient survival times.
Our study is the first to combine rigorous evolutionary methodology
and with a novel clinical dataset of a large patient cohort to
quantify ITH in a rigorous and unbiased manner. We combine insights
from natural language processing with spatial statistics to quantify
biologically meaningful indices of cancer progression in a coherent
translational setting.
TOP Proceedings Track: Cancer
Presenting author: Andrew J. Sedgewick , University of Pittsburgh, United States
Tuesday, July 23: 11:30 a.m. - 11:55 a.m.Room: Hall 14.2
Area Session Chair: Lonnie Welch
Presentation Overview: High-dimensional “-omics” profiling provides a detailed molecular view of individual cancers, however understanding the mechanisms by which tumors evade cellular defenses requires deep knowledge of the underlying cellular pathways within each cancer sample. We extended the PARADIGM algorithm (Vaske et al., 2010), a pathway analysis method for combining multiple “-omics” data types, to learn the strength and direction of 9139 gene and protein interactions curated from the literature. Using genomic and mRNA expression data from 1936 samples in The Cancer Genome Atlas (TCGA) cohort, we learned interactions that provided support for and relative strength of 7138 (78%) of the curated links. Gene set enrichment found that genes involved in the strongest interactions were significantly enriched for transcriptional regulation, apoptosis, cell cycle regulation, and response to tumor cells. Within the TCGA breast cancer cohort we assessed different interaction strengths between breast cancer subtypes, and found interactions associated with the MYC pathway and the ER alpha network to be among the most differential between basal and luminal A subtypes. PARADIGM with the Naive Bayesian assumption produced gene activity predictions that, when clustered, found groups of patients with better separation in survival than both the original version of PARADIGM and a version without the assumption. We found that this Naive Bayes assumption was valid for the vast majority of co-regulators, indicating that most co-regulators act independently on their shared target. Availability: http://paradigm.five3genomics.com
TOP Proceedings Track: Comparative Genomics
Presenting author: John Capra , Vanderbilt University, us
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 15.2
Presentation Overview: Interpreting patterns of DNA sequence variation between the genomes of closely related species is critically important to understanding the causes and functional effects of nucleotide substitutions. In addition to well-studied adaptive processes, like natural selection, other forces influence substitution patterns. GC-biased gene conversion (gBGC) is a recombination-associated evolutionary process that favors the fixation of strong (G/C) over weak (A/T) alleles. In mammals, gBGC is thought to promote variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations. It also has the potential to produce false positives in common tests for positive selection. However, because it is difficult to incorporate gBGC into existing statistical models of evolution, its genome-wide influence is poorly understood. In this work, we describe a new phylogenetic hidden Markov model that jointly models the effects of selection and gBGC and apply it to the human and chimpanzee genomes. We find that gBGC has influenced a small, but important fraction of these genomes. Fast evolving regions and disease-associated polymorphisms show significant enrichment for gBGC. Overall, our analyses indicate that gBGC has been an important force in recent human evolution, and our publicly available algorithms and predictions will enable other researchers to consider gBGC in their analyses.
TOP
Presenting author: Steven Brenner , University of California, Berkeley, us
Monday, July 22
: 2:10 p.m. - 2:35 p.m.Room: Hall 15.2
Presentation Overview: Drosophila melanogaster and Caenorhabditis elegans are two well-studied model organisms in developmental biology. Their morphological development differ greatly, yet we postulated that there may nonetheless be underlying shared developmental programs employing orthologous genes. We used modENCODE RNA-Seq data to perform a transcriptome-wide comparison of their developmental time courses to address this question. Our approach centers on using stage-associated orthologous genes to link the two organisms. For every stage in each organism, we select stage-associated genes which are defined as relatively highly expressed at that stage compared with others. We tested the dependence of a pair of D. melanogaster and C. elegans stages in terms of orthologous gene expression—the number of orthologous gene pairs associated with both stages.
We first carried out the test on pairs of stages within D. melanogaster and C. elegans respectively, and we found that temporally adjacent stages in both species exhibit high dependence in gene expression, supporting the validity of this approach. When comparing fly with worm, we observed a strong colinearity of their developmental time courses from early embryos to late larvae. Another parallel collinear pattern is found between fly white prepupae through adults and worm late embryos through adults. Investigating stage-associated genes overlapped between stages shows that many- to-one fly-worm orthologs are key factors leading to the two collinear patterns. Some orthologs are known to play similar roles in both organisms, and their mapping in this study may help inform their functions in the development of D. melanogaster and C. elegans.
TOP Proceedings Track: Cryo-electron tomography
Presenting author: Min Xu , University of Southern California, United States
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81
Area Session Chair: Donna Slonim
Presentation Overview: Motivation: Cryo-electron tomography allows the imaging of macromolecular complexes in near living conditions. To enhance the nominal resolution of a structure it is necessary to align and average individual subtomograms each containing identical complexes. However, if the sample of complexes is heterogeneous, it is necessary to first classify subtomograms into groups of identical complexes. This task becomes challenging when tomograms contain mixtures of unknown complexes extracted from a crowded environment. Two main challenges must be overcome: First, classification of subtomograms must be performed without knowledge of template structures. However, most alignment methods are too slow to perform reference-free classification of a large number of (e.g. tens of thousands) of subtomograms. Second, subtomograms extracted from crowded cellular environments, contain often fragments of other structures besides the target complex. However, alignment methods generally assume that each subtomogram only contains one complex. Automatic methods are needed to identify the target complexes in a subtomogram even when its shape is unknown.
Results: In this paper, we propose an automatic and systematic method for the isolation and masking of target complexes in subtomograms extracted from crowded environments. Moreover, we also propose a fast alignment method using fast rotational matching in real space. Our experiments show that, compared to our previously proposed fast alignment method in reciprocal space, our new method significantly improves the alignment accuracy for highly distorted and especially crowded subtomograms. Such improvements are important for achieving successful and unbiased high-throughput reference-free structural classification of complexes inside whole cell tomograms.
TOP Proceedings Track: Databases & Ontologies
Presenting author: Junwen Wang , The University of Hong Kong, cn
Tuesday, July 23: 2:10 p.m. - 2:35 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: TOP Proceedings Track: de Bruijn sequence
Presenting author: Yaron Orenstein , Tel-Aviv University, Israel
Tuesday, July 23: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5
Area Session Chair: Debra Goldberg
Presentation Overview: Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure in vitro the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism's genome allow us to measure in vivo the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of k that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is therefore to cover all k-mers with a minimal number of probes.The standard way to do this uses the de Bruijn sequence of length 4^k. However, since probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well. Here we show how to efficiently create a
shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases. By reducing the total sequence length, experimental limitations can be overcome; alternatively, additional sequences with redundant k-mers of interest can be added.
TOP Proceedings Track: Disease Models & Epidemiology
Presenting author: Wenzhong Xiao , Massachusetts General Hospital/Harvard Medical School and Stanford University, us
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 7
Area Session Chair: Predrag Radivojac
Presentation Overview: TOP
Presenting author: Timothy Tickle , Harvard School of Public Health, us
Tuesday, July 23: 2:10 p.m. - 2:35 p.m.Room: Hall 7
Area Session Chair: Alfonso Valencia
Presentation Overview: TOP Proceedings Track: DNA binding specificity
Presenting author: Fantine Mordelet , Duke University, United States
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5
Area Session Chair: Erik Bongcam-Rudloff
Presentation Overview: Motivation: The DNA binding specificity of a transcription factor (TF)
is typically represented using a position weight matrix (PWM) model,
which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that
does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have
their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret.
Results: We propose novel regression-based models of TF-DNA binding specificity, trained using high resolution in vitro data from
custom protein binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding
sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human
TFs c-Myc, Max, and Mad2) in their native genomic context. These
high-throughput, quantitative data are well suited for training complex
models that take into account not only independent contributions from
individual bases, but also contributions from di- and trinucleotides at
various positions within or near the binding sites. To ensure that our
models remain interpretable, we use feature selection to identify a
small number of sequence features that accurately predict TF-DNA
binding specificity. To further illustrate the accuracy of our regression
models, we show that even in the case of paralogous TF with
highly similar PWMs, our new models can distinguish the specificities
of individual factors. Thus, our work represents an important step
towards better sequence-based models of individual TF-DNA binding
specificity.
Availability: Our code is available at http://genome.duke.edu/labs/
gordan/ISMB2013. The PBM data used in this paper are available in
the Gene Expression Omnibus under accession number GSE44604.
TOP Proceedings Track: dose-response analysis
Presenting author: Russell Schwartz, Carnegie Mellon University, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81
Area Session Chair: Donna Slonim
Presentation Overview: Motivation: Development and progression of solid tumors can be attributed to a process of mutations, which typically includes changes in the number of copies of genes or genomic regions. Although comparisons of cells within single tumors show extensive heterogeneity, recurring features of their evolutionary process may be discerned by comparing multiple regions or cells of a tumor. A particularly useful source of data for studying likely progression of individual tumors is fluorescence in situ hybridization (FISH), which allows one to count copy numbers of several genes in hundreds of single cells. Novel algorithms for interpreting such data phylogenetically are needed, however, to reconstruct likely evolutionary trajectories from states of single cells and facilitate analysis of their evolutionary trajectories.
Results: In this paper, we develop phylogenetic methods to infer likely models of tumor progression using FISH copy number data and apply them to a study of FISH data from two cancer types. Statistical analyses of topological characteristics of the tree-based model provide insights into likely tumor progression pathways consistent with the prior literature. Furthermore, tree statistics from the resulting phylogenies can be used as features for prediction methods. This results in improved accuracy, relative to unstructured gene copy number data, at predicting tumor state and future metastasis.
Availability: A package of source code for FISH tree building (FISHtrees) and the data on cervical cancer and breast cancer examined here are publicly available at the site ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees.
TOP Proceedings Track: Drug-Target Interaction
Presenting author: Jianyang Zeng, Tsinghua University, China
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 14.2
Area Session Chair: Serafim Batzoglou
Presentation Overview: Motivation:
In silico prediction of drug-target interactions plays an important role towards identifying and developing new uses of existing or abandoned drugs. Network-based approaches have recently become a popular tool for discovering new drug-target interactions. Unfortunately, most of these network-based approaches can only predict binary interactions between drugs and targets, and information about different types of interactions has not been well exploited for drug-target interaction prediction in previous studies. On the other hand, incorporating additional information about drug-target relationships or drug modes of action can improve prediction of drug-target interactions. Furthermore, the predicted types of drug-target interactions can broaden our understanding about the molecular basis of drug action.
Results:
We propose a first machine learning approach to integrate multiple types of drug-target interactions and predict unknown drug-target relationships or drug modes of action. We cast the new drug-target interaction prediction problem into a two-layer graphical model, called restricted Boltzmann machine (RBM), and apply a practical learning algorithm to train our model and make predictions. Tests on two public databases show that our RBM model can effectively capture the latent features of a drug-target interaction network, and achieve excellent performance on predicting different types of drug-target interactions, with the area under precision-recall curve (AUPR) up to 89.6. In addition, we demonstrate that integrating multiple types of drug-target interactions can significantly outperform other predictions either by simply mixing multiple types of interactions without distinction or using only a single interaction type. Further tests show that our approach can infer a high fraction of novel drug-target interactions that has been validated by known experiments in the literature or other databases. These results indicate that our approach can have highly practical relevance to drug-target interaction prediction and drug repositioning, and hence advance the drug discovery process.
Availability: Software and datasets are available upon request.
TOP Proceedings Track: Epigenetics
Presenting author: Harri Lähdesmäki , Aalto University, fi
Monday, July 22
: 3:40 p.m. - 4:05 p.m.Room: Hall 15.2
Presentation Overview: Multipotent CD4+ T cells are central to the adaptive immune system. CD4+ T cells can differentiate to functionally distinct effector subtypes such as T helper 1 (Th1), Th2, Th17, and iTreg. In this study, we have focused on identification of histone modifications (H3K4me1, H3K27ac, H3K4me3) that define the cell-type specific functional cis-regulatory repertoire for early differentiating human Th1 and Th2 cells. Additionally, we have integrated genome-wide digital gene expression analysis from the Helicos platform to correlate epigenetic information with gene expression. We also overlay the identified enhancer regions with open chromatin sites (DNase-seq) from fully differentiated T cells to characterize whether early enhancers are active only during the early lineage specification or remain active in committed Th cells. By analyzing transcription factor binding sites at enhancers we are able to identify known and novel transcriptional regulators which drive the lineage determination. Lastly, under the principle that improper cell fate specification can lead to immunopathogenesis, we found within these lineage-specific enhancers a great number of SNPs from genome-wide association studies (GWAS) that were associated with various autoimmune disorders including T1D, rheumatoid arthritis, Crohn’s disease, and asthma. Several alter transcription factor binding site motifs, and using DAPA experiments we show for a subset of such SNPs within these predicted sites that they influence transcription factor binding. This study provides the first look at how enhancers can contribute to early human T cell lineage specification. Our results also provide insight into how regulatory SNPs may contribute to the disease pathogenesis.
TOP
Presenting author: Meromit Singer, UC Berkeley, United States
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 15.2
Presentation Overview: Genome-wide functional assays based on high-throughput sequencing now allow for experimental probing of a wide variety of molecular phenotypes. Among these is DNA methylation, which can be probed at all CpG sites in the genome using bisulfite sequencing. This has allowed for comparisons of methylation extent in different functional regions by first averaging methylation states within region types and then comparing averages between regions. Such comparisons have become commonplace in genome-wide DNA methylation studies. For example, it has been repeatedly reported that the methylation extent is significantly higher in coding regions as compared to introns or UTRs. We report and characterize a bias present in these seemingly straightforward comparisons that is a special case of the Yule-Simpson's effect and show it has extensively altered the magnitude and significance of DNA methylation differences observed and reported from such comparative studies. The bias we discuss arises from the dependance of the sparsity of CpG sites on the extent of evolutionary pressure at a region, together with its overall methylation state. We present a correction utilizing a matrix completion algorithm that is based on a methylation model and show how it affects reported results regarding differences in DNA methylation across functional regions.
TOP Proceedings Track: Evolution & Comparative Genomics
Presenting author: Yuval Tabach , Massachusetts General Hospital/ Harvard Medical School, us
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: ICC Lounge 81
Area Session Chair: Burkhard Rost
Presentation Overview: TOP
Presenting author: Milana Frenkel-Morgenstern , Spanish National Cancer Research Centre (CNIO), es
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 7
Area Session Chair: Alex Bateman
Presentation Overview: TOP
Presenting author: Michal Linial , The Hebrew University of Jerusalem, il
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: ICC Lounge 81
Area Session Chair: Burkhard Rost
Presentation Overview: TOP
Presenting author: David Juan , Spanish National Cancer Research Centre, es
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: ICC Lounge 81
Area Session Chair: Burkhard Rost
Presentation Overview: TOP Proceedings Track: Feature selection
Presenting author: Chloé-Agathe Azencott , Max-Planck-Institutes Tübingen, Germany
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 4/5
Area Session Chair: Russell Schwartz
Presentation Overview: As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings.
We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints, which can be solved exactly and rapidly.
SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci and exhibits higher power in detecting causal SNPs in simulation studies than other methods. On flowering time phenotypes and genotypes from Arabidposis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature.
TOP Proceedings Track: Functional Genomics
Presenting author: Young-suk Lee, Princeton University, United States
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 15.2
Presentation Overview: Directly dealing with multicellularity and heteorogeneity of human gene expression samples is paramount for understanding human homeostasis, disease manifestation and pharmacokinetics/pharmacodynamics. However, leveraging gene expression data through large-scale integrative analyses is challenging because most samples are not fully annotated to their tissue/cell-type of origin. A computational method to classify samples using their entire gene expression profiles is needed. Such a method must be applicable across thousands of independent studies, hundreds of gene expression technologies, and hundreds of diverse human tissues and cell-types. We present URSA (Unveiling RNA Sample Annotation) that leverages the complex tissue/cell-type relationships and simultaneously estimates the probabilities associated to hundreds of tissues/cell-types for any given gene expression profile. URSA provides accurate and intuitive probability values for expression profiles across independent studies and outperforms other methods irrespective of data preprocessing techniques. Moreover, without re-training, URSA can be used to classify samples from diverse microarray platforms and even from next generation sequencing technology. Finally, we provide a molecular interpretation for the tissue and cell-type models as the biological basis for URSA’s classifications.
TOP
Presenting author: Daniela Boernigen , Harvard School of Public Health, Harvard University, us
Tuesday, July 23
: 11:30 a.m. - 11:55 a.m.Room: Hall 10
Presentation Overview: Biological databases of high-throughput experimental results provide vast and growing resources for medical, and bioinformatic research. Open questions remain in how best to maintain such resources, access them computationally, meta-analyze their contents from hundreds of experiments, and do so reproducibly while maintaining computational best practices.
We present ARepA, an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing. ARepA allows configurable data access for any organism(s) from the GEO, IntAct, BioGRID, RegulonDB, STRING, Bacteriome, and MPIDB databases. A user can retrieve raw data and metadata from these repositories, normalize data files, and automatically process them in standardized ways (e.g. for network analysis). When retrieving data from six model organisms, ARepA currently produces more than 2M interactions (600K physical interactions, 4K regulatory interactions, 1.5M functional associations) and 2.7K gene expression data sets covering approx. 800K samples, accompanied by corresponding metadata and derived network data.
We include biological examples demonstrating the utility of ARepA for integrative analyses. When focusing on human data, ARepA's metadata database allowed us to identify and standardize 12 human prostate cancer gene expression datasets from GEO, which were subsequently meta-analyzed across six different platforms. A subsequent co-expression network analysis correctly recovered the NfκB signaling pathway along with new candidate genes with roles in prostate cancer. A similar example in mouse integrates 11 gene expression datasets selected by querying ARepA for metadata indicating germ-free and intestinal tissue conditions. Finally, multiple data types from three model microbes were integrated to assess differences in peptide secretion systems.
TOP
Presenting author: Tamir Tuller , Tel Aviv University, il
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: Hall 10
Presentation Overview: One of the greatest challenges of functional genomics is to decipher the way information encoded in that transcript affects various aspects of its expression regulation. Since it is impossible to determine the causality based on the analysis of endogenous sequence features and expression levels we suggest a combined and novel computational-synthetic biology approach. The talk will survey large scale synthetic biology experiments for understanding three aspects of gene expression: 1) splicing, 2) translation elongation; 3) translation initiation from out-of-frame codons; in each experiment a specific library including hundreds of heterologous genes has been tailored to tackle the corresponding question, expression levels of all the library genes have been expressed in S. cerevisiae, and the results were computationally analyzed.
Among others, our analyses emphasize the contribution of local folding strength in different parts of the transcript, and the position and distribution of codons to splicing and translation efficiency and fidelity. In addition, we report novel sets of enhancer and silencer sequence motifs that contribute to various aspects of translation and splicing regulation.
I will also explain how the results inferred in the three studies are integrated, and compared to existing computational biophysical models of gene expression, and will compare the obtained results to the ones reported recently via an evolutionary systems biology analysis of endogenous genes.
TOP Proceedings Track: Gene Ontology
Presenting author: Wyatt Clark , Indiana University, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. While various algorithms have been proposed for these tasks, evaluating their performance is difficult due to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. In this work, we propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank or train classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that we address several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools.
TOP Proceedings Track: Gene Regulation & Transcriptomics
Presenting author: Aviad Tsherniak , Broad Institute of MIT and Harvard, us
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 14.2
Area Session Chair: Lonnie Welch
Presentation Overview: TOP
Presenting author: Caroline Friedel , Ludwig-Maximilians-Universität München, de
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 7
Area Session Chair: Ivo Hofacker
Presentation Overview: TOP
Presenting author: Hendrik Tiedemann , Helmholtz Center Munich, de
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 14.2
Area Session Chair: Lonnie Welch
Presentation Overview: TOP
Presenting author: Christopher Ng , Massachusetts Institute of Technology, us
Sunday, July 21: 11:30 a.m. - 11:55 p.m.Room: Hall 14.2
Area Session Chair: Russell Schwartz
Presentation Overview: TOP
Presenting author: Peter Glaus , University of Manchester, uk
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 14.2
Area Session Chair: Cenk Sahinalp
Presentation Overview: TOP
Presenting author: Petr Nazarov , Centre de Recherche Public de la Sante, lu
Tuesday, July 23: 2:10 p.m. - 2:35 p.m.Room: Hall 14.2
Area Session Chair: Ralf Zimmer
Presentation Overview: TOP Proceedings Track: Genetic Variation Analysis
Presenting author: Karin Verspoor, NICTA, Australia
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 10
Presentation Overview: We assess a mutation extraction tool with respect to the task of curation of the literature for the purpose of populating a database of genetic variation information. Our analysis shows that the ability of text mining tools to recover the mutations catalogued in the databases is far less than what would be expected based on the typically excellent performance of such tools on intrinsic evaluation. While lack of access to the full text of publications has been argued to explain this phenomenon, we show show that the effect persists even when the full text article that was indicated to be the direct source of a mutation in a curated resource is available for processing. We explore several possible explanations for these results, including difficulties in linking genetic variants to specific genes, and the inclusion of data from high-throughput experiments. The results of our work have implications for the future development of text mining systems for genetic variation.
TOP Proceedings Track: Genome Organization and Annotation
Presenting author: Andrzej Kudlicki , University of Texas Medical Branch, us
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: Hall 15.2
Presentation Overview: Genome-wide chromatin conformation capture experiments allow characterizing the spatial structure of genome; however, existing methods of data processing provide no means of appreciating the variability between the cells in the sample. We present a novel algorithmic framework that addresses this problem by analyzing the geometric and topological characteristics of an experimental DNA contact network. Our method applied to the measurement of interactions in the yeast genome of Duan et al (2010) prove that indeed no homogeneous conformation can agree with the observed 3C contacts, and attempting to construct a homogeneous 3D model will lead to thousands of geometrically impossible structural motifs. The topological properties of the DNA contact network, along with Occam’s razor principle, are used to reconstruct the chromatin conformations characteristic of uniform subpopulations of cells confounding the experimental sample. Specifically, the individual chromatin states are inferred by analyzing and coloring a line graph representing geometrical conflicts within the DNA contact network, i.e., loci whose direct interpretation will lead to violation of the triangle inequality. We show that hundreds of thousands of conflicting interactions can be resolved by just a handful of chromatin states, and the the properties of these states point to different transcriptional programs being executed.
TOP
Presenting author: Davide Bau , Centro Nacional de Analisis Genomica, es
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: Hall 15.2
Presentation Overview: Advances in genomic technologies have allowed getting better insights into how the genome is organized inside the cell nucleus. Recently, it has been shown that chromatin is organized in Topologically Associating Domains (TADs), large interaction domains that appear to be conserved among different cell types. To determine whether these TADs have a functional role during the dynamic changes of gene expression in terminally differentiated cells, we studied the relationship between the spatial position of Progesterone (Pg) responsive genes and the TAD structure in breast cancer cells. Using Hi-C data, we found that the genome is organized into about 2,000 TADs. TADs were similarly positioned before and after hormone treatment; nonetheless the Pg induced some changes in the intra-TAD chromatin interactions. Unexpectedly, a large proportion of genes that responded similarly upon Pg treatment was clustered within individual TADs, indicating a topological segregation of Pg up- and down-regulation sites. Remarkably, hormone induced correlated epigenetic changes that spread over several 100kb, revealing regional remodeling of chromatin. Although consecutive TADs can be covered by one or more similar epigenetic changes, their combination differs among individual consecutive TADs, reflecting topologically restrained combinatory chromatin signatures. Integrative 3D modeling of the intra-TAD contacts before and after Pg stimulation further supports this hypothesis, showing dynamic structural changes correlated with the transcriptional response. Given the segregation of target genes in TADs and the fine-tuning of Pg induced chromatin changes, we propose that TADs behave as regulons enabling spatially proximal genes to be coordinately transcribed in response to hormone.
TOP Proceedings Track: Haematopoietic stem cell
Presenting author: Nicola Bonzanni , VU University Amsterdam, Netherlands
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: Hall 14.2
Area Session Chair: Lonnie Welch
Presentation Overview: Motivation:
Combinatorial interactions of transcription factors with cis-regulatory elements control the dynamic progression through successive cellular states and thus underpin all metazoan development. The construction of network models of cis-regulatory elements therefore has the potential to generate fundamental insights into cellular fate and differentiation. Haematopoiesis has long served as a model system to study mammalian differentiation, yet modelling based on experimentally informed cis-regulatory interactions has so far been restricted to pairs of interacting factors. Here we have generated a Boolean network model based on detailed cis-regulatory functional data connecting 11 haematopoietic stem/progenitor cell (HSPC) regulator genes.
Results:
Despite its apparent simplicity, the model exhibits surprisingly complex behaviour that we charted using strongly connected components and shortest-path analysis in its Boolean state space. This analysis of our model predicts that HSPCs display heterogeneous expression patterns and possess many intermediate states that can act as ‘stepping stones’ for the HSPC to achieve a final differentiated state. Importantly, an external perturbation or ‘trigger’ is required to exit the stem cell state, with distinct triggers characterising maturation into the various different lineages. By focussing on intermediate states occurring during erythrocyte differentiation, from our model we predicted a novel negative regulation of Fli1 by Gata1 which we confirmed experimentally thus validating our model.
In conclusion, we demonstrate that an advanced mammalian regulatory network model based on experimentally validated cis-regulatory interactions has allowed us to make novel, experimentally testable hypotheses about transcriptional mechanisms that control differentiation of mammalian stem cells.
TOP Proceedings Track: Haplotype
Presenting author: Derek Aguiar , Brown University, United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 14.2
Area Session Chair: Sean O'Donoghue
Presentation Overview: Motivation: Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing these high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (1) do not consider individuals sharing haplotypes jointly which reduces the size and accuracy of assembled haplotypes and (2) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Particularly, polyploid organisms are becoming the target of many research groups interested in studying the genomics of disease, phylogenetics, botany, and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction.
Results: In this work, we present a number of results, extensions, and generalizations of Compass graphs and our HapCompass framework (Aguiar et al. 2012). We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. We present graph theory-based algorithms for the problem of haplotype assembly from sequencing data using our previously developed HapCompass framework for (1) novel implementations of haplotype assembly optimizations (minimum error correction), (2) assembly of a pair of individuals sharing a tract identical by descent, and (3) assembly of polyploid genomes. We demonstrate the accuracy of each method on the 1000 Genomes Project, Pacific Biosciences, and simulated sequence data.
HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/
TOP Proceedings Track: Host-pathogen protein interaction
Presenting author: Meghana Kshirsagar , Carnegie Mellon University , United States
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5
Area Session Chair: Olga Vitek
Presentation Overview: Motivation:
An important aspect of infectious disease research involves understanding the differences and commonalities in the infection mechanisms underlying various diseases. Systems biology based approaches study infectious diseases by analyzing the interactions between the host species and the pathogen organisms. This work aims to combine the knowledge from experimental studies of host-pathogen interactions in several diseases in order to build stronger predictive models. Our approach is based on a formalism from machine-learning called `multi-task learning', which considers the problem of building models across tasks that are related to each other. A `task' in our scenario is the set of host-pathogen protein interactions involved in one disease. To integrate interactions from several tasks (i.e diseases), our method exploits the similarity in the infection process across the diseases. In particular, we use the biological hypothesis that similar pathogens target the same critical biological processes in the host, in defining a common structure across the tasks.
Results:
Our current work on host-pathogen protein interaction prediction focuses on human as the host, and four bacterial species as pathogens. The multi-task learning technique we develop uses a task based regularization approach. We find that the resulting optimization problem is a difference of convex (DC) functions. To optimize, we implement a Convex-Concave procedure based algorithm. We compare our integrative approach to baseline methods that build models on a single host-pathogen protein interaction dataset. Our results show that our approach outperforms the baselines on the training data. We further analyse the protein interaction predictions generated by the models, and find some interesting insights.
TOP Proceedings Track: Identity-by-Descent
Presenting author: Dan He , IBM T.J. Watson, United States
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5
Area Session Chair: Russell Schwartz
Presentation Overview: Detecting Identity-by-Descent (IBD) is a very important problem in genetics. Most of the existing methods focus on detecting pairwise IBDs, which have relatively low power to detect short IBDs. Methods to detect IBDs among multiple individuals simultaneously, or group-wise IBDs, have better performance for short IBD detection. In the meanwhile group-wise IBDs can be applied to a wide range of applications such as disease mapping, pedigree reconstruction, etc. The existing group-wise IBD detection method is computationally inefficient and is only able to handle small data sets such as 20, 30 individuals with hundreds of SNPs. It also requires a prior specification of the number of IBD groups, which may not be realistic in many cases. The method can only handle small number of IBD groups such as two or three due to scalability issue. What's more, it does not take LD into consideration. In this work, we developed a very efficient method \textit{IBD-Groupon}, which detects group-wise IBDs based on pairwise IBD relationships and it is able to address all the drawbacks mentioned above. To our knowledge, our method is the first group-wise IBD detection method that is scalable to very large data sets, for example, hundreds of individuals with thousands of SNPs, and in the meanwhile is powerful to detect short IBDs. Our method does not need to specify the number of IBD groups, which will be detected automatically. And our method takes LD into consideration as it is based on pairwise IBDs where LD can be easily incorporated.
TOP Proceedings Track: Image processing
Presenting author: Saket Navlakha , Carnegie Mellon University , United States
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 7
Area Session Chair: Stefan Kramer
Presentation Overview: Motivation: Synaptic connections underlie learning and memory in the brain and are dynamically formed and eliminated during development and
in response to stimuli. Quantifying changes in overall density and strength of synapses is an important pre-requisite for studying
connectivity and plasticity in these cases or in diseased conditions. Unfortunately, most techniques to detect such changes are either
low-throughput (e.g. electrophysiology), prone to error and difficult to automate (e.g. standard electron microscopy), or too coarse (e.g.
MRI) to provide accurate and large-scale measurements. Results: To facilitate high-throughput analyses, we used a 50-year-old
experimental technique to selectively stain for synapses in electron microscopy (EM) images, and we developed a machine learning framework
to automatically detect synapses in these images. To validate our method we experimentally imaged brain tissue of the somatosensory
cortex in six mice. We detected thousands of synapses in these images and demonstrate the accuracy of our approach using cross-validation
with manually labeled data and by comparing against existing algorithms and against tools that process standard EM images. We also
used a semi-supervised algorithm that leverages unlabeled data to overcome sample heterogeneity and improve performance. Our algorithms
are highly efficient and scalable and are freely available for others to use.
TOP Proceedings Track: Mass Spectrometry & Proteomics
Presenting author: Mathieu Clément-Ziza , Biotec, Technische Universitaet Dresden, de
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 14.2
Area Session Chair: Sean O'Donoghue
Presentation Overview: TOP
Presenting author: Michael Liam Tress , Centro Nacional de Investigaciones Oncologicas (CNIO), es
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81
Area Session Chair: Janet Kelso
Presentation Overview: TOP Proceedings Track: Metabolic network
Presenting author: Masaaki Kotera , Kyoto University, Japan
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: ICC Lounge 81
Area Session Chair: Hagit Shatkay
Presentation Overview: Motivation: The metabolic pathway is an important biochemical reaction network involving enzymatic reactions among chemical compounds. However, it is assumed that a large number of metabolic pathways remain unknown, and many reactions are still missing even in known pathways. Therefore, the most important challenge in metabolomics is the automated de novo reconstruction of metabolic pathways, which includes the elucidation of previously unknown reactions to bridge the metabolic gaps.
Results: In this paper we develop a novel method to reconstruct metabolic pathways from a large compound set in the reaction-filling framework. We define feature vectors representing the chemical transformation patterns of compound-compound pairs in enzymatic reactions using chemical fingerprints. We apply a sparsity-induced classifier to learn what we refer to as ”enzymatic-reaction likeness”, i.e., whether or not compound pairs are possibly converted to each other by enzymatic reactions. The originality of our method lies in the search for potential reactions among many compounds at a time, in the extraction of reaction-related chemical transformation patterns, and in the large-scale applicability owing to the computational efficiency. In the results, we demonstrate the usefulness of our proposed method on the de novo reconstruction of 134 metabolic pathways in KEGG. Our comprehensively predicted reaction networks of 15,698 compounds enable us to suggest many potential pathways and to increase research productivity in metabolomics.
TOP Proceedings Track: Metabolic networks
Presenting author: John Pinney, Imperial College London, United Kingdom
Sunday, July 21: 10:30 a.m. - 10:55 a.m.Room: Hall 4/5
Area Session Chair: Erik Bongcam-Rudloff
Presentation Overview: Motivation: Misannotation in sequence databases is an important
obstacle for automated tools for gene function annotation, which
rely extensively on comparison to sequences with known function.
To improve current annotations and prevent future propagation of
errors, sequence-independent tools are therefore needed to assist
in the identification of misannotated gene products. In the case
of enzymatic functions, each functional assignment implies the
existence of a reaction within the organism’s metabolic network;
a first approximation to a genome-scale metabolic model can
be obtained directly from an automated genome annotation. Any
obvious problems in the network, such as dead-end or disconnected
reactions, can therefore be strong indications of misannotation.
Results: We demonstrate that a machine learning approach using
only network topological features can successfully predict the validity
of enzyme annotations. The predictions are tested at 3 different
levels. A random forest using topological features of the metabolic
network and trained on curated sets of correct and incorrect enzyme
assignments was found to have an accuracy of up to 86% in 5-fold
cross validation experiments. Further cross validation against unseen
enzyme superfamilies indicates that this classifier can successfully
extrapolate beyond the classes of enzyme present in the training
data. The random forest model was applied to several automated
genome annotations, achieving an accuracy of 60% in most cases
when validated against recent genome-scale metabolic models. We
also observe that when applied to draft metabolic networks for
multiple species, a clear negative correlation is observed between
predicted annotation quality and phylogenetic distance to the major
model organism for biochemistry (Escherichia coli for prokaryotes
and Homo sapiens for eukaryotes).
Contact: j.pinney@imperial.ac.uk
TOP Proceedings Track: Metabolic pathways
Presenting author: Cesim Erten, Kadir Has University
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 7
Area Session Chair: Alfonso Valencia
Presentation Overview: Given a pair of metabolic pathways, an alignment of the pathways corresponds to
a mapping between similar substructures of the pair. Successful alignments may provide useful applications in phylogenetic tree reconstruction, drug design, and overall may enhance our understanding of cellular metabolism. We consider the problem of providing one-to-many alignments of reactions in a pair of metabolic
pathways. We first provide a constrained alignment framework applicable to the problem. We show that the constrained alignment problem even in a very primitive setting is computationally intractable which justifies efforts for designing efficient heuristics. We present our Constrained Alignment of Metabolic Pathways (CAMPWays) algorithm designed for this purpose. Through extensive experiments involving a large pathway database we demonstrate that when compared to a state-of-the-art alternative, the CAMPWays algorithm provides better alignment results on metabolic networks as far as measures based same-pathway inclusion are concerned. The execution speed of our algorithm constitutes yet another important improvement over alternative algorithms.
TOP Proceedings Track: microRNA
Presenting author: Hai-Son Le , Carnegie Mellon, United States
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2
Area Session Chair: Ralf Zimmer
Presentation Overview: Motivation: MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally. MiRNAs were shown to play an important role in development and disease, and accurately determining the networks regulated by these miRNAs in a specific condition is of great interest. Early work on miRNA target prediction has focused on utilizing static sequence information. More recently, researchers have combined sequence and expression data to identify such targets in various conditions.
Results: Here we propose a regression-based probabilistic method that integrates sequence, expression and interaction data to identify modules of mRNAs controlled by small sets of miRNAs. We formulate an optimization problem and develop a learning framework to determine the module regulation and membership. Applying our method to cancer data we show that by adding protein interaction data and modeling combinatorial regulation our method can accurately identify both miRNA and their targets improving upon prior methods. We next used our method to jointly analyze a number of different types of cancers and identified both common and cancer type specific miRNA regulators.
TOP Proceedings Track: Network topology
Presenting author: Carlo Vittorio Cannistraci , King Abdullah University of Science and Technology, Saudi Arabia
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81
Area Session Chair: Hagit Shatkay
Presentation Overview: Motivation: Most functions within the cell emerge thanks to protein-protein-interactions (PPIs), yet their experimental determination is both expensive and time consuming. PPI-networks present signifi-cant levels of noise and incompleteness. Prediction of interactions using solely PPI-network-topology (topological prediction) is difficult but essential when biological prior-knowledge is absent or unreliable.
Methods: Network-embedding emphasizes relations between net-work proteins embedded in a low-dimensional space, where protein-pairs closer to each other represent potential candidate interactions to predict. Network denoising, which boosts the prediction perfor-mance, is here achieved by minimum-curvilinear-embedding (MCE), combined with the shortest-path (SP) adopted in the reduced space for assigning likelihood scores to candidate interactions. Further-more, we introduce: (i) a new valid variation of MCE named non-centred-MCE (ncMCE); (ii) two automatic strategies for the selection of the appropriate embedding-dimension; (ii) two new randomised procedures for prediction evaluation.
Results: We compared our method against several unsupervised and supervised embedding approaches, and node-neighbourhood techniques. Despite its computational simplicity, ncMCE-SP was the overall leader outperforming the current methods for topological link prediction.
Conclusion: Minimum curvilinearity is a valuable nonlinear frame-work, which we successfully applied in embedding of protein net-works for unsupervised prediction of novel PPIs. The rationale is that biological and evolutionary prior-information is imprinted in the nonlinear patterns hidden behind the protein network topology, and can be exploited for prediction of new protein links. The predicted PPIs represent good candidates to test in high-throughput experi-ments or to exploit in systems biology tools such as those used for network-based inference and prediction of disease-related functional modules.
TOP Proceedings Track: Nonparametric sparse Bayesian factor analysis
Presenting author: Iulian Pruteanu-Malinici , Duke University, United States
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: Hall 7
Area Session Chair: Stefan Kramer
Presentation Overview: Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously.
Methods: We describe a discriminative, undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a nonparametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy, incomplete samples, i.e. it can tolerate data missing from individual time points.
Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared to previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages.
TOP Proceedings Track: Parameter estimation
Presenting author: Xin Gao , King Abdullah University of Science and Technology, Saudi Arabia
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: ICC Lounge 81
Area Session Chair: Hagit Shatkay
Presentation Overview: Motivation:
Systematic and scalable parameter estimation is a key to construct complex gene regulatory models and to ultimately facilitate an integrative systems biology approach to quantitatively understand the molecular mechanisms underpinning gene regulation.
Results:
Here, we report a novel framework for efficient and scalable parameter estimation that focuses specifically on modeling of gene circuits.
Exploiting the structure commonly found in gene circuit models, this framework decomposes a system of coupled rate equations into individual ones and efficiently integrates them separately to reconstruct the mean time evolution of the gene products. The accuracy of the parameters is refined by iteratively increasing the accuracy of numerical integration using the model structure. As a case study, we applied our framework to four gene circuit models with complex dynamics based on three synthetic data sets and one time-series microarray data set. We compared our framework to three state-of-the-art parameter estimation methods and found that our approach consistently generated higher quality parameter solutions efficiently.
While many general-purpose parameter estimation methods have been applied for modeling of gene circuits, our results suggest that the use of more tailored approaches to employ domain specific information may be a key to reverse-engineering of complex biological systems.
Availability:
Website: http://sfb.kaust.edu.sa/Pages/Software.aspx
TOP Proceedings Track: Pathway inference
Presenting author: Anthony Gitter , Carnegie Mellon University , United States
Monday, July 22: 2:10 p.m. - 2:35 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: Several types of studies, including genome-wide association studies and RNA interference screens, strive to link genes to diseases. Although these approaches have had some success, genetic variants are often only present in a small subset of the population and screens are noisy with low overlap between experiments in different labs. Neither provides a mechanistic model explaining how identified genes impact the disease of interest or the dynamics of the pathways those genes regulate. Such mechanistic models could be used to accurately predict downstream effects of knocking down pathway members and allow comprehensive exploration of the effects of targeting pairs or higher-order combinations of genes.
We developed methods to model the activation of signaling and dynamic regulatory networks involved in disease progression. Our model, SDREM, integrates static and time series data to link proteins and the pathways they regulate in these networks. SDREM utilizes prior information about proteins' likelihood of involvement in a disease (e.g. from screens) to improve the quality of the predicted signaling pathways. We used our algorithms to study the human immune response to H1N1 influenza infection. The resulting networks correctly identified many of the known pathways and transcriptional regulators of this disease. Furthermore, they accurately predict RNA interference effects and can be used to infer genetic interactions, greatly improving over other methods suggested for this task. Applying our method to the more pathogenic H5N1 influenza allowed us to identify several strain-specific targets of this infection.
TOP Proceedings Track: PCA
Presenting author: Noa Liscovitch , Bar Ilan University, Israel
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 7
Area Session Chair: Stefan Kramer
Presentation Overview: High spatial resolution imaging datasets of mammalian brains have recently become available in unprecedented amounts. Images now reveal highly complex patterns of gene expression varying on multiple scales. The challenge in analyzing these images is both in extracting the patterns that are most relevant functionally, and in providing a meaningful representation that allows neuroscientists to interpret the extracted patterns.
Here we present FuncISH – a method to learn functional representations of neural in situ hybridization (ISH) images. We represent images using a histogram of local descriptors (SIFT) in several scales, and use this representation to learn detectors of functional (GO) categories for every image. As a result, each image is represented as a point in a low dimensional space whose axes correspond to meaningful functional annotations. The resulting representations define similarities between ISH images that can be easily explained by functional categories.
We applied our method to the genomic set of mouse neural ISH images available at the Allen Brain Atlas, finding that the majority of GO biological processes can be inferred from spatial expression patterns with high accuracy. Using functional representations, we predict several gene interaction properties such as protein-protein interactions and cell type specificity more accurately than competing methods based on global correlations. We used FuncISH to identify similar expression patterns of GABAergic neuronal markers that were not previously identified, and to infer new gene function based on image-image similarities.
TOP Proceedings Track: Poly(A) motif
Presenting author: Bo Xie , Georgia Institute of Technology, United States
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 14.2
Area Session Chair: Cenk Sahinalp
Presentation Overview: Motivation:
Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA.
Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.
Results:
We propose a novel machine learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we employed hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine tune the classification performance.
We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14,740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of previous state-of-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false negative rate and false positive rate by 26%, 15% and 35%, respectively. Meanwhile, our method made about 30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before.
Availability:
website:http://sfb.kaust.edu.sa/Pages/Software.aspx
TOP Proceedings Track: Population genetics
Presenting author: Pier Francesco Palamara , Columbia University, United States
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5
Area Session Chair: Russell Schwartz
Presentation Overview: Pairs of individuals from a study cohort will often share long-range haplotypes identical-by-descent (IBD). Such haplotypes are transmitted from common ancestors that lived tens to hundreds of generations in the past, and can now be efficiently detected in high-resolution genomic datasets, providing a novel source of information in several domains of genetic analysis. Recently, haplotype sharing distributions were studied in the context of demographic inference, and were used to reconstruct recent demographic events in several populations. We here extend such framework to handle demographic models that contain multiple demes interacting through migration. We extensively test our formalism in several demographic scenarios, and provide a freely available software tool for demographic inference.
TOP Proceedings Track: Population Genomics
Presenting author: Yufeng Wu , University of Connecticut, us
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 4/5
Area Session Chair: Russell Schwartz
Presentation Overview: TOP Proceedings Track: PPI-Network
Presenting author: Alex Lan , Ben Gurion University, Israel
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 7
Area Session Chair: Alfonso Valencia
Presentation Overview: A major challenge in systems biology is to reveal the cellular pathways that give rise to specific phenotypes and behaviours. Current techniques often rely on a network representation of molecular interactions, where each node represents a protein or a gene and each interaction is assigned a single static score. However, the use of single interaction scores fails to capture the tendency of proteins to favour different partners under distinct cellular conditions. Here we propose a novel context-sensitive network model, in which genes and protein nodes are assigned multiple contexts based on their gene ontology annotations, and their interactions are associated with multiple context-sensitive scores. Using this model we developed a new approach and a corresponding tool, ContextNet, based on a dynamic programming algorithm for identifying signalling paths linking proteins to their downstream target genes. ContextNet finds high-ranking context-sensitive paths in the interactome, thereby revealing the intermediate proteins in the path and their path-specific contexts. We validated the model using 18,348 manually-curated cellular paths derived from the SPIKE database. We next applied our framework to elucidate the responses of human primary lung cells to influenza infection. Top-ranking paths were much more likely to contain infection-related proteins, and this likelihood was highly correlated with path score. Moreover, the contexts assigned by the algorithm pointed to putative as well as previously known responses to viral infection. Thus context-sensitivity is an important extension to current network biology models and can be efficiently used to elucidate cellular response mechanisms.
ContextNet is publicly available at http://netbio.bgu.ac.il/ContextNet.
TOP Proceedings Track: PRM: Protein Recognition module
Presenting author: Kousik Kundu , University of Freiburg, Germany
Sunday, July 21: 11:30 a.m. - 11:55 a.m.Room: Hall 4/5
Area Session Chair: Erik Bongcam-Rudloff
Presentation Overview: State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. Here we present a machine learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are a very important class of PRMs. The graph-kernel strategy allows us to 1) integrate several types of physico-chemical information for each amino acid, 2) consider high order correlations between these features and 3) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve (AUC PR), compared to 0.27 AUC PR for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the on-interacting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position-weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains which interact with short peptides (i.e., other PRMs).
TOP Proceedings Track: Protein contact map prediction
Presenting author: Jinbo Xu , Toyota Technological Institute at Chicago, United States
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: Hall 7
Area Session Chair: Alex Bateman
Presentation Overview: Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map by using mutual information (MI) and enforcing a sparsity restraint (i.e., the contact matrix shall be very sparse), but these methods demand for a very large number of sequence homologs and the resultant contact map may be still physically infeasible.
Results. This paper presents a novel method for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints are much more informative than MI and the physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results show that our method outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration.
TOP Proceedings Track: Protein interaction evolution
Presenting author: Robert Patro , Carnegie Mellon University, United States
Monday, July 22: 3:10 p.m. - 3:35 p.m.Room: ICC Lounge 81
Area Session Chair: Burkhard Rost
Presentation Overview: Motivation: Reconstruction of the network-level evolutionary history of
protein-protein interactions provides a principled way to relate interactions
in several present-day networks. Here, we present a general framework for
inferring such histories and demonstrate how it can be used to determine what
interactions existed in the ancestral networks, which present-day interactions
should we expect to exist based on evolutionary evidence, and what information
extant networks contain about the order of ancestral protein duplications.
Results: Our framework characterizes the space of likely parismonious network
histories. It results in a structure that can be used to find probabilities for
a number of events associated with the histories. The framework is based on a
directed hypergraph formulation of dynamic programming that we extend to
enumerate many optimal and near-optimal solutions. The algorithm is applied to
reconstructing ancestral interactions among bZIP transcription factors,
imputing missing present-day interactions among the bZIPs and among proteins
from 5 herpes viruses, and determining relative protein duplication order in
the bZIP family. Our approach more accurately reconstructs ancestral
interactions compared with existing approaches. In cross-validation tests, we find
that our approach ranks the majority of the left-out present-day interactions
among the top 2% and 17% of possible edges for the bZIP and herpes networks,
respectively, making it a competitive approach for edge imputation. It also
estimates, from interaction data alone, relative bZIP protein duplication
orders that are significantly correlated with sequence-based estimates.
Availability: The algorithm is implemented in C++, is open source,
and available at http://www.cs.cmu.edu/~ckingsf/software/parana2.
Contact: robp@cs.cmu.edu and carlk@cs.cmu.edu
TOP Proceedings Track: Protein Interactions & Molecular Networks
CancelledPresenting author: Rohith Srivas , University of California, San Diego, us
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 4/5
Area Session Chair: Olga Vitek
Presentation Overview: TOP
Presenting author: Jacques Colinge , CeMM, at
Sunday, July 21: 2:10 p.m. - 2:35 p.m.Room: Hall 4/5
Area Session Chair: Olga Vitek
Presentation Overview: TOP
Presenting author: Paula Petrone , Hoffmann-La Roche, ch
Monday, July 22: 12:00 p.m. - 12:25 p.m.Room: Hall 14.2
Area Session Chair: Serafim Batzoglou
Presentation Overview: TOP
Presenting author: Gabriele Sales , Università di Padova, it
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 7
Area Session Chair: Alfonso Valencia
Presentation Overview: TOP
Presenting author: Inna Kuperstein , Institut Cuire, fr
Sunday, July 21: 3:40 p.m. - 4:05 p.m.Room: Hall 4/5
Area Session Chair: Olga Vitek
Presentation Overview: TOP
Presenting author: Alexey Stukalov , CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, at
Monday, July 22: 10:30 a.m. - 10:55 a.m.Room: ICC Lounge 81
Area Session Chair: Hagit Shatkay
Presentation Overview: TOP
Presenting author: Janusz Dutkowski , University of California, San Diego, us
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: TOP Proceedings Track: Protein Structure & Function
Presenting author: Elisa Cilia , Université Libre de Bruxelles, be
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: ICC Lounge 81
Area Session Chair: Janet Kelso
Presentation Overview: TOP
Presenting author: Predrag Radivojac , Indiana University, us
Tuesday, July 23: 11:30 a.m. - 11:55 p.m.Room: ICC Lounge 81
Area Session Chair: Janet Kelso
Presentation Overview: TOP
Presenting author: Avner Schlessinger , Mount Sinai School of Medicine, us
Sunday, July 21: 11:00 a.m. - 11:25 a.m.Room: Hall 14.2
Area Session Chair: Russell Schwartz
Presentation Overview: TOP Proceedings Track: Protein Structure and Function Prediction and Anal
CancelledPresenting author: John-Marc Chandonia , Berkeley National Lab, us
Sunday, July 21
: 3:40 p.m. - 4:05 p.m.Room: Hall 15.2
Presentation Overview: The Structural Classification of Proteins (SCOP) database is a manually curated, near-comprehensive ordering of domains from proteins of known structure in a hierarchy according to their structural and evolutionary relationships. The ASTRAL compendium is a collection of software and databases, closely related to SCOP, that is used to aid research into protein structure and evolution. We released new versions of both SCOP and ASTRAL (1.75B) in January 2013. The new releases are the second in a series of stable SCOP and ASTRAL releases based on SCOP 1.75. New versions of both databases are presented to the public through a single, unified interface (http://scop.berkeley.edu/). New features include a SQL-based infrastructure and build procedure, a fully automated classification scheme for new PDB entries that are similar to previously classified entries, and periodic incremental releases to supplement the stable releases. More than 11,300 new PDB entries have been added since SCOP 1.75, without sacrificing the reliability that SCOP has accumulated through years of careful manual curation. We plan to introduce additional features in a series of stable releases, while a major reclassification (SCOP 2.0) is in progress.
TOP
Presenting author: Andrew Bordner , Mayo Clinic, us
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 15.2
Presentation Overview: The discovery of which mutations contribute to a particular disease is an important biomedical problem with potential applications in drug discovery, disease diagnosis and prognosis, and the development of improved personalized therapies. To this end, we have developed a computational method that integrates complementary approaches for predicting the biochemical effects of missense mutations using genome-wide generation of homology models for human protein complexes. Mutations affecting diverse types of binding sites are identified by homology to available X-ray structures of complexes and machine learning classifiers while spatial clustering of mutations is used to detect other compact regions of the protein structure important for its function. A Random Forest classifier trained on results from these structure-based methods, as well as annotations from online databases, evolutionary conservation, and predicted stability changes was found to outperform current popular prediction methods. Finally, the predicted biochemical effects of mutations showed good agreement with experimental assays.
TOP
Presenting author: Argyris Politis , Univeristy of Oxford, uk
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 15.2
Presentation Overview: In recent years, integrative structure determination of protein complexes has garnered great interest as a result of the vast amount of data obtained by different experiments. In particular integrative approaches have gained attention for studying highly heterogeneous and dynamic systems which remain refractory to structure determination by conventional methods. Key developments in emerging mass spectrometry (MS)-based techniques, such as native MS and ion mobility (IM)-MS, have led to their integration into the structural biologist’s pipeline. Here we present an integrative approach for structure determination of protein assemblies by combining native mass spectrometry (MS), ion mobility-MS and chemical cross-linking MS. The accuracy and confidence levels of this approach are demonstrated by encoding data from MS techniques into restraints for assembling a set of known hetero-complexes from their building blocks. This method enabled us to characterize the structures of two unknown precursors acting en route to the assembly of the AAA-ATPase base subcomplex within proteasome, a macromolecule responsible for the controlled degradation of intracellular proteins.
TOP
Presenting author: David Gfeller , Swiss Institute of Bioinformatics, ch
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 10
Presentation Overview: SH3 domains bind peptides to mediate protein-protein interactions that assemble and regulate dynamic biological processes. We surveyed the repertoire of SH3 binding specificity using peptide phage display in a metazoan, the worm Caenorhabditis elegans, and discovered that it structurally mirrors that of the budding yeast Saccharomyces cerevisiae. We then mapped the worm SH3 interactome using stringent yeast two-hybrid and compared it to the equivalent map for yeast. We found that the worm SH3 interactome resembles the analogous yeast network because it is significantly enriched for proteins with roles in endocytosis. Nevertheless, orthologous SH3 domain mediated interactions are highly rewired. Our results suggest a model of network evolution where general function of the SH3 domain network is conserved over its specific form.
TOP Proceedings Track: Protein structure prediction
Presenting author: Zhidong Xue , University of Michigan, United States
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 7
Area Session Chair: Alex Bateman
Presentation Overview: Motivation: Protein domains are subunits that can fold and function independently. Identification of domain boundary locations is often the first step in protein folding and function annotations. Most of the current methods deduce domain boundaries by sequence-based analysis where accuracy is low. There is no efficient method for predicting discontinuous domains that consist of segments from separated sequences. Since template-based methods are most efficient for protein 3D structure modeling, combining multiple threading alignment information should increase the accuracy and reliability of computational domain predictions.
Result: We develop a new domain predictor, ThreaDom, which deduces protein domain boundary locations based on multiple threading alignments. The core of the method development is the derivation of a domain conservation score that combines composite information from template domain structures and terminal and internal alignment gaps. Tested on 630 non-redundant sequences, without using homologous templates ThreaDom generates correct single- and multi-domain classifications in 81% of cases where 78% have the domain linker location assigned within 20 residues. In a second test on 486 proteins with discontinuous domains, ThreaDom achieves an average precision 84% and a recall 65% in domain boundary prediction. Finally, ThreaDom was examined on 56 targets from CASP8 and had a domain overlap rate 73%, 87% and 85% with the target structure for Free Modeling, Hard multiple-domain and discontinuous domain proteins, respectively, which are significantly higher than most of the domain predictors in the CASP8 experiment.
TOP Proceedings Track: Protein Threading
Presenting author: Sheng Wang, Toyota Technological Institute at Chicago, United States
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 7
Area Session Chair: Alex Bateman
Presentation Overview: Motivation: Template-based modeling (TBM) including homology modeling and protein threading is the most reliable method for pro-tein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current TBM methods, especially when proteins under consideration are distantly related.
Results: We present a novel context-specific alignment potential for protein threading including alignment and template selection. Our alignment potential measures the log odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global context-specific information. The local alignment potential quantifies how well one sequence residue can be aligned to one template residue based upon context-specific information of the residues. The global alignment potential quantifies how well two sequence residues can be placed into two template positions at a given distance, again based upon context-specific information. By accounting for correla-tion among a variety of protein features and making use of context-specific information, our alignment potential is much more sensitive than the widely used context-independent or profile-based scoring function. Experimental results confirm that our method generates significantly better alignments and threading results than the best profile-based methods on several very large benchmarks. Our method works particularly well for distantly-related proteins or pro-teins with sparse sequence profiles due to the effective integration of context-specific, structure and global information.
TOP Proceedings Track: Proteomics
Presenting author: Louis-Francois Handfield , University of Toronto, ca
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 10
Presentation Overview: The characterization of protein abundance and stochastic abundance has been systematically defined in budding yeast using fluorescently tagged proteins. Subcellular location can also be systematically uncovered using supervised machine learning approaches that have been trained to recognize predefined image classes based on statistical features. As an alternative, we capture cell stage dependence of protein spatial expression within automatically identified cells. We use the identified the bud area as cell-stage indicator. We show that similarities between the inferred expression patterns contain more information about protein function than can be explained by a previous manual categorization of subcellular localization. Further analysis reveals that such a characterization allows identify a 12% of the 4004 proteins by finding the protein that is closest in expression pattern in a replicate experiment. This characterization includes stochasticity levels in measurement, which are correlated with previous reports in the case of stochasticity in protein abundance. Other stochasticity levels, such as in compactness for protein expression, are shown to be reproducible. Changes in cell morphology due to the alpha factor mating pheromone or changes of fluorescents markers required for segmentation also have a limited impact on the measured variability levels. Our results suggest that quantitative cell-stage dependent representations of protein spread discriminates protein spatial expressions without requiring predefined subcellular location classes. We show that some major quantified deviations, such as high spatial variability, are systematically detected under a spectrum of experimental conditions.
TOP
Presenting author: Michal Linial , The Hebrew University of Jerusalem, il
Tuesday, July 23: 2:40 p.m. - 3:05 p.m.Room: Hall 10
Presentation Overview: Translation must be tightly controlled for coping with the cell's demand and its limited resources. Energetically, translation is the most expensive operation in dividing cells. We applied a measure of tRNA adaptation index (tAI) as an indirect proxy for the translation rate. We tested the possibility that sequence determinants are encoded along the transcripts to govern translational efficiency. The secretory proteome comprises about 30% of the proteins in human and other multi-cellular model systems. Many of these proteins contain at their N’-terminal a segment that is called Signal Peptide (SP) which determines a translocation to the ER. Indeed, all SP-proteins are translated by ER-membrane bound ribosomes. We anticipated that proteins translated by free or bound ribosomes differ with respect to their overall translation speed. We demonstrate that clusters of poorly adapted codons followed by abundant codons specify the N’-terminal of secreted and SP-membranous proteins. The phenomenon is generalized to the proteomes of yeast, fly and worm despite a poor correlation among their codon tAI values. We propose that translation determinants are evolved to match the cellular needs for translational rate. The codons’ arrangement along transctipts is crucial for management of synaptic sites and poorly folded protein translation. The appearance of low tAI codons at the N'-terminal of SP proteins attenuates the elongation rate. We conclude that processes such as translocation through the ER membrane, processing, maturation and folding are dependent on a specific codon arrangement that dictates a delay in translational elongation.
TOP Proceedings Track: Pseudogene
Presenting author: Wei Wang, UCLA, United States
Sunday, July 21: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2
Area Session Chair: Cenk Sahinalp
Presentation Overview: Motivation:
RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives), and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that about 3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, about 10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.
Results:
We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls due to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that more than 16.3% of them are false positives.
Availablility:
The software can be downloaded at http://csbio.unc.edu/genescissors/
TOP Proceedings Track: RNA
Presenting author: Vladimir Reinharz , McGill University, Canada
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 14.2
Area Session Chair: Ralf Zimmer
Presentation Overview: Motivations: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software use similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GC-content in their sequences. However, the latter is an important criteria for large-scale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity.
Results: In this paper, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seed-less (we remove the bias of the seed in local search heuristics), and successfully generates high-quality sequences (i.e. thermodynamically stable) for any GC-content. To complete this study, we develop an hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology outperforms both local and global approaches.
TOP Proceedings Track: RNA structure prediction
Presenting author: Hamidreza Chitsaz, Wayne State University, United States
Tuesday, July 23: 3:10 p.m. - 3:35 p.m.Room: Hall 14.2
Area Session Chair: Ralf Zimmer
Presentation Overview: Motivation: Computational RNA structure prediction is a mature important problem which has received a new wave of attention with the discovery of regulatory non-coding RNAs and the advent of high-throughput transcriptome sequencing. Despite nearly two scores of research on RNA secondary structure and RNA-RNA interaction prediction, the accuracy of the state-of-the-art algorithms are still far from satisfactory. So far, researchers have proposed increasingly complex energy models and improved parameter estimation methods, experimental and/or computational, in anticipation of endowing their methods with enough power to solve the problem. The output has disappointingly been only modest improvements, not matching the expectations. Even recent massively featured machine learning approaches were not able to break the barrier. Why is that?
Approach: The first step towards high accuracy structure prediction is to pick an energy model that is inherently capable of predicting each and every one of known structures to date. In this paper, we introduce the notion of learnability of the parameters of an energy model as a measure of such an inherent capability. We say that the parameters of an energy model are learnable iff there exists at least one set of such parameters that renders every known RNA structure to date the minimum free energy structure. We derive a necessary condition for the learnability and give a dynamic programming algorithm to assess it. Our algorithm computes the convex hull of the feature vectors of all feasible structures in the ensemble of a given input sequence. Interestingly, that convex hull coincides with the Newton polytope of the partition function as a polynomial in energy parameters. To the best of our knowledge, this is the first approach towards computing the RNA Newton polytope and a systematic assessment of the inherent capabilities of an energy model. The worst complexity of our algorithm is expontential in the number of features. However, one could employ dimensionality reduction techniques to avoid the curse of dimensionality.
Results: We demonstrated the application of our theory to a simple energy model consisting of a weighted count of A-U, C-G, and G-U base pairs. Our results show that this simple energy model satisfies the necessary condition for more than half of the input unpseudoknotted sequence-structure pairs (55%) chosen from the RNA STRAND v2.0 database and severely violates the condition for about 13%, which provide a set of hard cases that require further investigation. From 1350 RNA strands, the observed three dimensional feature vector for 749 strands is on the surface of the computed polytope. For 289 RNA strands, the observed feature vector is not on the boundary of the polytope but its distance from the boundary is not more than one. A distance of one essentially means one base pair difference between the observed structure and the closest point on the boundary of the polytope, which need not be the feature vector of a structure. For 171 sequences, this distance is larger than 2, and for only 11 sequences, this distance is larger than 5.
TOP Proceedings Track: Sequence Analysis
Presenting author: Noah Daniels , Tufts University, United States
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed up homology search directly, but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools.
Results: We introduce a suite of homology search tools, powered by compressively-accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate to all known state- of-the-art tools including HHblits, DELTA-BLAST, and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed up many tasks such as protein structure prediction and orthology mapping which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/
TOP
Presenting author: Xuejian Xiong , Hospital for Sick Children, ca
Sunday, July 21: 12:00 p.m. - 12:25 p.m.Room: Hall 7
Area Session Chair: Predrag Radivojac
Presentation Overview: TOP
Presenting author: Martin Weigt , Universite Pierre and Marie Curie, fr
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: ICC Lounge 81
Area Session Chair: Janet Kelso
Presentation Overview: TOP
Presenting author: Denisa Duma , University of California Riverside, us
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: Hall 4/5
Area Session Chair: Debra Goldberg
Presentation Overview: TOP
Presenting author: Rajeev Azad , University of North Texas, us
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: TOP
Presenting author: Steven Brenner , University of California, Berkeley, us
Sunday, July 21: 3:10 p.m. - 3:35 p.m.Room: Hall 7
Area Session Chair: Ivo Hofacker
Presentation Overview: TOP
Presenting author: Misook Ha , Samsung Advanced Institute of Technology, kr
Monday, July 22: 3:40 p.m. - 4:05 p.m.Room: Hall 14.2
Area Session Chair: Sean O'Donoghue
Presentation Overview: TOP
Presenting author: Michael Baym , Harvard Medical School, us
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 14.2
Area Session Chair: Serafim Batzoglou
Presentation Overview: TOP Proceedings Track: Sequencing
Presenting author: David Golan , Tel Aviv University, Israel
Monday, July 22: 2:40 p.m. - 3:05 p.m.Room: Hall 14.2
Area Session Chair: Sean O'Donoghue
Presentation Overview: Motivation:
The importance of fast and affordable DNA sequencing methods for current day life sciences, medicine and biotechnology is hard to overstate. A major player is IonTorrent, a pyrosequencing-like technology which produces flowgrams – sequences of incorporation values – which are converted into nucleotide sequences by a base-calling algorithm. Because of its exploitation of ubiquitous semiconductor technology and innovation in chemistry, IonTorrent has been gaining popularity since its debut in 2011. Despite the advantages, however, IonTorrent read accuracy
remains a significant concern.
Results:
We present FlowgramFixer, a new algorithm for converting flowgrams into reads. Our key observation is that the incorporation signals of neighboring flows, even after normalization and phase correction, carry considerable mutual information and are important in making the correct base-call. We therefore propose that base-calling of flowgrams should be done on a read-wide level, rather than one flow at a time. We show that this can be done in linear time by combining a state machine with a Viterbi algorithm to find the nucleotide sequence that maximizes the likelihood of the observed flowgram. FlowgramFixer is applicable to any flowgram based sequencing platform. We demonstrate FlowgramFixer’s superior performance on Ion Torrent E.Coli data, with a 4.8% improvement in the number of high-quality mapped reads and a 7.1% improvement in the number of uniquely mappable reads.
Availability:
Binaries and source code of FlowgramFixer are freely available at:
http://www.cs.tau.ac.il/˜davidgo5/flowgramfixer.html
TOP Proceedings Track: Short read alignment
Presenting author: Victoria Popic , Stanford University, United States
Tuesday, July 23: 11:00 a.m. - 11:25 a.m.Room: Hall 4/5
Area Session Chair: Debra Goldberg
Presentation Overview: The increasing availability of high throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to-date there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this paper. We (1) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (2) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of 2 or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.
TOP Proceedings Track: Systems Biology and Networks
Presenting author: Armaghan Naik, Carnegie Mellon University, United States
Monday, July 22: 11:00 a.m. - 11:25 a.m.Room: Hall 15.2
Presentation Overview: High throughput screening involves determination of the effect of many chemical compounds on a given cellular target. As currently practiced, a full set of measurements for all compounds for each new target is typically made, with little use of information from previous screens. To efficiently study compound effects on many targets, a means is needed for determining and exploiting similarities in the effects of compounds and/or behavior of targets such that measurements of all combinations of compounds and targets are not needed to achieve high accuracy. Here, we describe probabilistic models that can be used to predict results for unmeasured combinations, and active learning algorithms for selecting future informative batches of experiments. Through extensive simulated experiments we showed that our approaches can produce powerful predictive models and learn them significantly faster than can be done by random choice. We further characterized our method’s performance experimentally using a collection of 48 compounds and 48 NIH 3T3 cell clones expressing different GFP-tagged proteins; the learner’s task was to efficiently build a model of the effects of each compound on each clone. Since none of the effects were known prior to beginning the experiments, each clone and compound was silently duplicated to provide the ability to check how well duplicates were recognized. The learner could to request acquisition of batches of image data for specific combinations of drugs and clones using liquid handling robotics and an automated microscope. Our method achieved a 92% accuracy having only sampled 28% of the experiment space.
TOP
Presenting author: Elisenda Feliu , University of Copenhagen, dk
Monday, July 22
: 10:30 a.m. - 10:55 a.m.Room: Hall 15.2
Presentation Overview: The number of states in which a cell can be at any given time is linked to the flexibility in its decision making and to cell-to-cell variability. Particularly, bi- and multistable cellular systems provide mechanisms for rapidly switching between different responses. Identifying whether a system exhibits multistable behavior or not is, however, challenging. The theoretical determination of small motifs in gene regulatory networks and signaling pathways that can exhibit multistationarity has been the focus of several studies in the past. However, it remains unclear to what extend these motifs are actually highly represented in living cells.
We have developed a computational method that gives a necessary condition for a system to exhibit multistationarity. If a system is multistationary, we can screen all small subnetworks and determine the key components in multistationarity. We have applied the method to 365 models extracted from the publicly available database Biomodels with data precomputed in PoCab. In this way, we have obtained a catalog of small motifs responsible for multistationarity in real systems.
At the conference, the method will be briefly described and the exhaustive analysis of the Biomodels database, including the small structures causing multistationarity, will be presented
TOP
Presenting author: Tiffany Chen , Stanford University, us
Monday, July 22: 11:30 a.m. - 11:55 a.m.Room: Hall 15.2
Presentation Overview: Most cell-based drug screening methods identify and evaluate potential drug candidates based on measurements of cell death or target inhibition. Using these approaches, the global impact of these drug candidates on cell cycle and signaling networks is greatly deemphasized, even though quantitative analysis of the cell cycle is fundamental to most anti-cancer drug development. Single-cell multiparameter flow cytometry can simultaneously measure intracellular proteins including those participating in the cell cycle and signaling pathways. To date, however, no automated, data-driven method exists for processing such biologically complex measurements. To address this need, we developed Tour-Recovered Automatic models for Cellular Continuums (TRACC), a computational methodology for automatically reconstructing the cell cycle de novo from flow cytometry data. TRACC reconstructs cell cycle progression without prior expert knowledge, thus setting a foundation for automated cell cycle analysis.
TOP
Presenting author: Jacques Colinge , CeMM, at
Tuesday, July 23: 2:10 p.m. - 2:35 p.m.Room: Hall 10
Presentation Overview: Interactions between proteins and nucleic acids (NAs) play a pivotal role in a wide variety of essential biological processes. Transcription factors that recognize specific DNA motifs only constitute part of the NA-binding proteins (NABPs). In this study, we present the first large-scale effort to systematically map human NABPs with generic classes of nucleic acids. Using 25 carefully designed synthetic DNA and RNA oligonucleotides as baits and affinity purification mass spectrometry (AP-MS), we performed pulldowns in three cell lines that yielded 10,000+ protein-NA interactions and involved 900+ proteins. Bioinformatic analysis allowed us to identify 139 new NABPs, to provide first experimental evidence for another 98, and to determine 513 specificities for 219 distinct NABPs for different subtypes of NAs.
Successful validation of 7/8 chosen new specificities confirmed the affinity of YB-1 for methylated cytosine. YB-1 is over-expressed in tumors and is associated with multiple drug resistance. Network analysis of YB-1 ChIP-seq peak nearest genes identified a subnetwork of 73 genes strongly associated with cancer pathways, thereby suggesting a potential epigenetic role of YB-1 in resistant tumors.
We could also show that non sequence specific proteins binding DNA do interact with nucleic acid chains through an interface that is more constraint in its geometry than proteins binding mRNA, which are known to contain more disordered regions.
To extend the experimental data we undertook a machine learning approach to derive a method of automatically inferring nucleic acid binding. We employed a family of support vector machines (SVMs) to predict NA binding de novo.
TOP Proceedings Track: Text mining
Presenting author: Sophia Ananiadou, The University of Manchester
Tuesday, July 23: 3:40 p.m. - 4:05 p.m.Room: Hall 4/5
Area Session Chair: Reinhard Schneider
Presentation Overview: Motivation: In order to create, verify and maintain pathway models, curators must discover and assess knowledge distributed over the vast body of biological literature. Methods supporting these tasks must understand both the pathway model representations and the natural language in the literature. These methods should identify and order documents by relevance to any given pathway reaction. No existing system has addressed all aspects of this challenge.
Method: We present novel methods for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models, and then turns them into queries for three text-mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machinelearning approaches.
Results: Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText.
Availability: An online demonstration of PathText 2 and the annotated corpus are available for research purposes at http://www.nactem.ac.uk/pathtext2/.
Contact: makoto.miwa@manchester.ac.uk
TOP Proceedings Track: Transcriptome assembling
Presenting author: Henry C.M. Leung, The University of Hong Kong
Tuesday, July 23: 10:30 a.m. - 10:55 a.m.Room: Hall 4/5
Area Session Chair: Debra Goldberg
Presentation Overview: Motivation: RNA sequencing based on next-generation sequencing technology is an effective approach for analyzing transcriptomes. Similar to de novo genome assembly, de novo transcriptome assembly does not rely on a reference genome or additional annotated information. It is well-known that the transcriptome assembly problem is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100) which make it very difficult to identify low-expressed isoforms. Technically, a core issue is to remove erroneous vertices/edges with high multiplicity (produced by high-expressed isoforms) in the de Bruijn graph without removing those correct ones with not so high multiplicity corresponding to low-expressed isoforms. Failing to do so will result in the loss of low-expressed isoforms or having complicated subgraphs with transcripts of different genes mixed together due to the erroneous vertices and edges.
Contributions: Unlike existing tools which usually remove erroneous vertices/edges if their multiplicities are lower than a global threshold, we developed a probabilistic progressive approach with local thresholds to iteratively remove those erroneous vertices/edges. This enables us to decompose the graph into disconnected components, each of which contains a few, if not single, genes, while keeping a lot of correct vertices/edges of low-expressed isoforms. Combined with existing techniques, IDBA-Tran is able to assemble both high-expressed and low-expressed transcripts and outperforms existing assemblers in terms of sensitivity and specificity for both simulated and real data.
Availability: http://www.cs.hku.hk/~alse/idba_tran
TOP Proceedings Track: other
Presenting author: Jennifer Cham , European Bioinformatics Institute, uk
Tuesday, July 23: 12:00 p.m. - 12:25 p.m.Room: Hall 7
Area Session Chair: Thomas Lengauer
Presentation Overview: TOP