11th Annual Rocky Mountain Bioinformatics Conference
POSTER PRESENTATIONS
Subject: Qualitative modeling and simulation
Presenting Author: Chaitanya Acharya, Duke UniversityAuthor(s):
Abstract:
Expression quantitative trait loci (eQTL) analysis associates putative regulatory variants (SNPs) with gene expression levels, which are treated as quantitative traits. Until recently, eQTL analysis is performed in a tissue-by-tissue basis followed by an examination of overlap of eQTLs across all tissues. However, most of those methods fall short in their ability to jointly analyze data across multiple tissues. Such type of joint analyses of tissue-types have been shown to improve power to identify eQTLs that have similar effects across tissues. We propose a variance component score test approach in a mixed-effects framework in order to jointly analyze multiple tissue types and assess the power of such tests. Using Monte Carlo simulations, we show that the new score test performs much better than the traditional likelihood ratio method in terms of statistical power. Using real data sets, we show that the new score test not only preserves power but also is computationally very efficient. We think that this method will particularly be very useful in prioritizing variants when analyzing heterogeneous disease model systems especially for any downstream genomic analysis including but not restricted to next-generation sequencing analysis.
top
Subject: Optimization and search
Presenting Author: Maryam Bagher Oskouei, University of OtagoAuthor(s):
Peter Dearden, University of Otago, New Zealand
Abstract:
Drosophila and Honeybee embryos are two examples that develop a segmented body plan during their early development. The basic body plan consists of distinct segments along their anterior-posterior axis established via a segmentation process. The process subdivides the embryos into segments, which is controlled by interactions between segmentation genes. Many experimental and computational works have been tested to reveal which interactions cause this process in Drosophila embryos, but few have been done for Honeybee embryos. The Honeybee genome has some aspects that make it worth studying. Honeybees are excellent comparative model systems that help to understand evolutionary pathways behind the segmentation process considering that the insects diverged ~350 million years ago. Here, we present a method using ordinary differential equations (ODEs) to model segmentation genes in Honeybee embryos. The initial and target models for ODEs were configured with data collected in Peter Dearden's lab. The computational modeling was carried out in order to explore how likely each gene is regulated by other genes positively or negatively. The simulations were performed in two phases, first as a Pre-stripes Networks and then the striped pattern forming Networks. The main findings predict gene networks that are more likely to pattern different parts of embryos along their anterior-posterior axis during early developmental stages. These results are comparable with Drosophila embryos. Importantly, the predicted networks provide hypotheses that can be tested experimentally.
top
Subject: Networking, web services, remote applications
Presenting Author: Janani Subbiah, Arizona State UniversityAuthor(s):
Janaka Balasooriya , Arizona State University, United States
Abstract:
Biological computations involving genes and proteins more often than not require tremendous amount of computing resources and the discovery of the required services is very time consuming. One of our earlier papers titled “Cloud Computing Infrastructure for Biological Echo-Systems” proposed a framework on a cloud environment. This poster provides a proof of concept application above towards realizing our framework. Web services for gene analysis are combined in such a way that the output of one web service serves as the input of another. The web services when deployed on a cloud infrastructure would facilitate scalability, improved response time and virtualization of required resources (which could be in the form of storage elements like databases or computing elements like servers).
The proof of concept application provides functionalities corresponding to Single Nucleotide Polymorphisms (SNPs), proteins and genes. SNP functions include finding consequence type, regulatory feature etc. Retrieval of protein features and protein interactions are functions of the application that come under proteins functions.
top
Subject:
Presenting Author: DAVID BALTRUSAITIS, Loyola University ChicagoAuthor(s):
Catherine Putonti, Loyola University Chicago, United States
Abstract:
Advances in next-generation sequencing technologies have led to increased numbers of metagenomic studies for a wide variety of environmental niches. Although first characterized in culturable species, the clustered regulated interspaced short palindromic repeats (CRISPR) system is just starting to be studied in these complex data sets. The CRISPR system, shown to provide bacteria with adaptive immunity to foreign genomic material, consists of loci of associated genes that code for pertinent proteins in conjunction with arrays of spacer and direct repeat sequences. These spacer sequences match a subsequence within the invading virus/plasmid and thus confer immunity. A handful of tools have been developed to detect these arrays, typically within long contig sequences or assembled genomes, with varied success. Furthermore, as previous research has shown, these tools are ill-equipped to examine shorter sequencing reads. We recently extended our existing tool, SpacerSeeker, to evaluate its performance in array detection, optimized for unassembled short read metagenomic data. Instead of only detecting spacer sequences, the program will now detect repeat and spacer units. This was performed in conjunction with a phylogenetic analysis of the direct repeats thus far observed in nature. Our own molecular work has indicated a high prevalence of the CRISPR system within freshwater environments. Although previous studies have detected CRISPR arrays within salt-water environments, little research has been devoted to exploring the prevalence of the CRISPR system in these communities. As such, we tested this new functionality on reads from bacteria isolated from freshwater samples.
top
Subject: Qualitative modeling and simulation
Presenting Author: Sven Bilke, National Cancer Institute
Abstract:
Evidence for a non-random spatial 3d organization of the cells DNA content and its relevance for gene regulation has been accumulating in recent years. In a recent study [1], Dekker and co-workers introduced a novel method, HiC, allowing for an unbiased genome wide study of 3d conformations producing a "probability map" of DNA-DNA contacts in an ensemble of cells. Here we aim to identify genomic parameters correlating with the 3d-structure described in [1].
We developed a model based on DNA sequence and chromatin structure related observables and a set of mixing parameters. Using Monte Carlo optimization techniques, we identify features correlating with the contact matrix. The resulting model reproduces the empirical consensus contact probability map described in [1] with Pearson's correlation r > 0.71.
[1] Lieberman-Aiden,E., Dekker,J. et al, Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950) 289-293 (2009).
top
Subject: Qualitative modeling and simulation
Presenting Author: Daniel Cuevas, San Diego State UniversityAuthor(s):
Daniel Garza, Evandro Chagas Institute, Brazil
Savannah Sanchez, San Diego State University, United States
Abstract:
Advances in large-scale genomic sequencing allow researchers to create accurate computational models of organisms through the use of gene annotation software, such as RAST (Rapid Annotation using Subsystem Technology). These bioinformatics software deduce gene function through homology-based distinctions that are dependent on previously verified information; thus new discoveries cannot be easily extrapolated from current analysis tools without experimental examination. Recent developments using phenotype microarrays (PMs) provide a high throughput, large-scale technique in profiling bacterial characteristics and their phenotypes. PMs have the potential to experimentally test various growth conditions and then provide bacterial yield in real-time. By coupling PM experiments with the advances of genomic sequencing and annotation, more robust and accurate computational models can be developed and confirmed.
Here we present a combined biological and computational approach that (1) uses optical density data from a PM system as input to evaluate various growth curves, and (2) optimize the flux-balance analysis (FBA) models by using the PM results as a base for in silico growth simulations. The bacterium Citrobacter sedlakii was sequenced and studied in the PM-FBA pipeline to assess the capabilities of our approach. RAST annotations produced a base computational model consisting of 1,367 enzymatic reactions. After PM-FBA optimization a total of 44 reactions were added to, or modified within, the model. The model correctly predicted the outcome on 89% of growth experiments.
top
Subject: Other
Presenting Author: Rebecca Davidson, National Jewish HealthAuthor(s):
Benjamin Garcia, University of Colorado Denver, United States
Paul Reynolds, National Jewish Health, United States
Eveline Farias-Hesson, National Jewish Health, United States
Rafael Silva Duarte, Universidade Federal do Rio de Janeiro, Brazil
Mary Jackson, Colorado State University, United States
Michael Strong, National Jewish Health, United States
Abstract:
Multiple isolates of Mycobacterium abscessus subsp. bolletii, collectively called BRA100, were associated with outbreaks of post-surgical skin infections across various regions of Brazil from 2003 to 2009. To investigate the genome content of these clinically important isolates, we sequenced, assembled and annotated the genome of one BRA100 strain called CRM0020 that was isolated from a patient in Rio de Janeiro, Brazil in 2006. The 4.8Mb draft genome contains 4794 predicted genomic features including 3224 (67.2%) genes with functional annotations, 1524 (31.8%) genes classified as hypothetical proteins and 46 tRNA. CRM0020 also contains a 56.4Kb plasmid sequence that encodes for 63 predicted proteins and shows 99% sequence identity to the previously described plasmids, pMAB01 and BRA100, which were derived from distinct Brazilian M. abscessus subsp. bolletii strains. Given the recent report of an outbreak of M. abscessus subsp. bolletii strains infecting cystic fibrosis patients in the United Kingdom (UK), we used a phylogenomics approach to compare the CRM0020 genome to multiple UK outbreak isolates as well as other globally diverse strains with publically available genomes. Analyses of genome-wide single nucleotide polymorphisms (SNPs) revealed that the Brazilian-derived CRM0020 strain is more closely related to UK outbreak isolates than to strains derived from patients in the United States, Europe or Malaysia. Our study merges new genome sequence data with existing genomic information to explore the global diversity of infectious M. abscessus isolates and to compare outbreak strains from different continents.
top
Subject: Other
Presenting Author: Shane Dorden, University of TampaAuthor(s):
Abstract:
Adenoviruses are double stranded DNA viruses that have a genome size of approximately 35kb. These viruses infect all vertebrates and human adenoviruses are associated with various illnesses such as acute respiratory disease, conjunctivitis, and gastroenteritis. Human adenoviruses are divided into seven species, A through G. Each of these species is further divided into types that are numbered numerically. Numerous adenovirus genomes have been sequenced and are available in GenBank. While most of the proteins in these adenovirus genomes have been annotated, there are several hypothetical proteins whose functions are unknown. Assignment of function to these proteins will yield greater insight into adenovirus pathogenicity and epidemiology. We extracted these hypothetical proteins from these genomes and used sequence analysis and structure prediction to infer the functions of these proteins. We found that some of these hypothetical proteins can be annotated with a greater degree of confidence than other proteins where broader functional predictions can be assigned.
top
Subject: System integration
Presenting Author: Mikhail Dozmorov, Oklahoma Medical Research FoundationAuthor(s):
Jonathan Wren, Oklahoma Medical Research Foundation, United States
Abstract:
The success of genome-wide association studies (GWASs) in finding causative SNPs for Mendelian phenotypes is contrasted with their inability to accurately elucidate complex patterns and biological roles of mutations underlying non-Mendelian inheritance. Our motivation was to find common epigenomic elements enriched with sets of disease-specific SNPs, and to systematically classify the diseases by their epigenomic background.
Human disease-specific SNPs were extracted from the UCSC GWAS catalog. We used our method, GenomeRunner (http://www.genomerunner.org) to test them for statistically significant associations with epigenomic data from the UCSC genome database. Disease-specific epigenomic associations were compared with random associations, obtained by testing random sets of SNPs. P-values of enriched associations were calculated using Fisher’s exact test, and corrected for multiple testing using Benjamini-Hochberg procedure.
212 disease and 363 trait/phenotype associated sets of SNPs were tested for associations with >4,000 genome annotation data. We identified that diseases/traits of similar origin (immunological, neurological, metabolic) tend to be located within similar epigenomic features. Our results suggest that alterations of specific epigenomic regulators may underlie disease susceptibility, guiding future epigenomic drug design and therapeutic targets.
The vast and growing amount of genome annotation data contains enormous potential to interpret sets of disease-associated mutations within a common, unifying theme of epigenomic regulators. Considering these themes will empower us to interpret the results of GWASs in terms of unifying mechanisms, complementing SNP-gene-pathway approaches. Conversely, similarities and differences in epigenomic context of disease- and trait-associated SNPs provide a new means to classify phenotypes and understand their common epigenomic denominators.
top
Subject:
Presenting Author: Bas Dutilh, Radboud University Medical CentreAuthor(s):
Abstract:
Determining the interrelationships between metagenomes from different biomes or different time points is important to understand the microbial world around us. Mapping metagenomic sequences to a reference database of known genes is a feasible approach to transfer taxonomical and functional annotations to sequence reads. However, it can limit the amount of data that can be analyzed because the majority of the sequencing reads in difficult-to-annotate datasets, such as viral metagenomes from biomes other than the human microbiome, lack known homologs. A promising alternative is reference-independent comparative metagenomics by cross-assembly.
Cross-assembly of different metagenomes is a fast and insightful way to obtain information about sequences that are shared between the samples, represented by cross-contigs. Importantly, cross-assembly is independent of an annotated reference database, providing a way to also handle unknown sequences. The cross-assembly tool crAss allows a rapid analysis of these cross-contigs. First, it provides cross-contig-based similarity scores between all metagenome pairs. Second, crAss creates insightful images displaying the inter-relationships between samples. Third, it generates occurrence profiles of the cross-contig sequences across metagenomes that can be used to discover related sequences, aiding further assembly and interpretation.
top
Subject: Networking, web services, remote applications
Presenting Author: Douglas Fenger, Dart NeuroScienceAuthor(s):
Philip Cheung, Dart NeuroScience, United States
Tim Tully, Dart NeuroScience, United States
Abstract:
Homologous relationships facilitate drug discovery by mapping gene/protein function between and within species, allowing functional predictions of novel or unknown genes. Additional benefits of cross-species mapping include the following: use of paralogs for selectivity/specificity screens to eliminate drug side effects, translation of pathway information from model organisms to humans, and allowing comparison and combination of data from different species.
GeneSeer (http://geneseer.com) is a publicly available tool that leverages public sequence data, gene metadata information, and other publicly available data to calculate and display orthologous and paralogous gene relationships for all genes from several species, including yeast, insects, worms, vertebrates, mammals, and primates including humans. GeneSeer calculates homology relationships and its interface is designed to help scientists quickly predict important attributes such as additional closely related family members and paralogous relationships. It is a useful tool for cross-species translational mapping and enables scientists to easily translate hypotheses about gene identity and function from one species to another. We have validated GeneSeer versus Homologene, the homolog prediction tool from NCBI. The results show that GeneSeer is as good as, if not better than, Homologene. Finally, a comparison of features shows GeneSeer to be the most feature rich when compared to alternative homology tools.
top
Subject: Graph Theory
Presenting Author: Suzanne Gallagher, University of Colorado Boulder
Abstract:
Biological networks are usually modeled using graphs where nodes represent molecules (e.g., genes, proteins, metabolites), and pairs are connected by an edge to indicate an association (e.g., co-expression, regulation, binding). However, this model is insufficient for some types of data, such as affinity purification protein interaction data, which captures the interaction amongst many proteins rather than just two. To model this non-binary data, an extension of graphs known as hypergraphs has been proposed. However, due to the relative newness of the study of large complex hypergraphs, many of the statistics we use to study biological networks have not been well defined.
We examine a commonly-used network statistic, the clustering coefficient, in the context of protein interaction networks. We examine previous suggestions on how to extend this statistic to hypergraphs and look at the physical meaning of these in terms of protein interactions. We also present several novel definitions for the clustering coefficient that may better capture the intent of the clustering coefficient in binary graphs. We determined how well the various statistics can predict proteins in complexes or co-complexed pairs of proteins. Our results show that many of the hypergraph clustering coefficients perform better at these tasks than the clustering coefficient on the usual binary graph representation. We conclude that hypergraphs do represent an improvement over graphs and provide recommendations on which clustering coefficient definitions perform best.
top
Subject: Text Mining
Presenting Author: Benjamin Good, The Scripps Research InstituteAuthor(s):
Andrew Su, The Scripps Research Institute, United States
Abstract:
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses, such as gene set enrichment evaluations, that would otherwise be impossible. As such, there is a long and fruitful history of BioNLP projects that apply natural language processing to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are still vital to the process of knowledge extraction but are in short supply.
Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text. In addition, several recent volunteer-based citizen science projects have demonstrated the public’s strong desire and ability to participate in the scientific process even without any financial incentives. Based on these observations, the mark2cure initiative is developing a Web interface for engaging large groups of people in the process of manual literature annotation. The system will support both microtask workers and volunteers. These workers will be directed by scientific leaders from the community to help accomplish ‘quests’ associated with specific knowledge extraction problems. In particular, we are working with patient advocacy groups such as the Chordoma foundation to identify motivated volunteers and to develop focused knowledge extraction challenges. We are currently evaluating the first prototype of the annotation interface using the AMT platform.
top
Subject: Text Mining
Presenting Author: Negacy Hailu, University of Colorado Computational Bioscience ProgramAuthor(s):
Abstract:
The number of publications in the biomedical domain is increasing exponentially. Searching for papers specific to a researcher’s interest in this domain is difficult. PubMed allows search using keywords but it doesn’t rank results based on document relevance. We present a recognizer for temporal expressions related to Cell Cycle Phase (CCP) concepts in biomedical literature. This task is one of the fundamental tasks towards building a search engine for queries with temporal components. Our ultimate goal is to build a specialized search engine, which is specific to searches in the CCP using genes and small molecules. We seek to improve search accuracy by allowing searches using semantic indexing instead of keywords. We identified 11 cell cycle related temporal expressions, for which we made extensions to TIMEX3, arranging them in an ontology derived from the Gene Ontology. We annotated 310 abstracts from PubMed. We developed annotation guidelines which are consistent with existing time related annotation guidelines such as TimeML. Two annotators participated in the annotation. We computed inter-annotator Agreement (IAA). We achieved an IAA of 0.79 for exact span match and 0.82 for relaxed constraints. Our approach is a hybrid of machine learning to recognize the temporal expressions and a rule-based approach to classify them. We trained a named entity recognizer using Conditional Random Fields (CRFs) models. We used an off-the-shelf implementation of the linear chain CRF model. We obtained a performance of 0.77 F-score for temporal expression recognition. We achieved 0.79 and 0.78 macro and micro average F-scores for classification.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Michael Hinterberg, University of Colorado-DenverAuthor(s):
David Kao, University of Colorado, United States
Abstract:
The increasing size and availability of large clinical datasets provides opportunity for discovery of novel, complex phenotypes in patients. Some of these phenotypes, such as drug responsiveness, are important for differential treatment modalities. Complex phenotypes may also be associated with arrays of diagnostic biomarkers; for example, differential expression of mRNA as well as microRNA can segregate different classes of patients.
In datasets with thousands of clinical features, testing hypotheses for associations between clinical phenotype and genetic expression can be a tedious process. Furthermore, slight modifications in patient stratification may have dramatic effects on biomarker association, but these differences may not be readily apparent. In ongoing work, we present a novel web-based visualization tool that allows the user to view and modify tree-based representations of clinical phenotypes and examine associations with microRNA and mRNA expression, with visible transitions that show the effect of modifying phenotype definition. A specific motivating application to drug-responsiveness in non-ischemic dilated cardiomyopathy is presented as well.
top
Subject: Graph Theory
Presenting Author: Barrett Hostetter-Lewis, California State University, ChicoAuthor(s):
Abstract:
The end of the 20th century marked the beginning of the era of large-scale studies identifying protein interactions. These large data sets catalyzed a renaissance in protein interaction network research. The mainstay of this research has been in elucidating the biological and evolutionary factors that affect the network's topological features. As the quantity of data increases and the quality of data improves we have been regularly refining our understanding of these biological and evolutionary influences. Now, as interaction data becomes available for recently sequenced organisms, there is a great opportunity for research on today's nascent interactomes to benefit from the analytical steps and missteps taken on fledgling interactome analysis 10-15 years ago.
Here we describe how the researcher's view of the Saccharomyces cerevisiae protein interaction network has changed since the first publication of large-scale yeast data in late 1999. By creating network snapshots using increasing amounts of interaction data constrained by date-of-publication and various quality criteria, we identify trends in the researcher's view of network topology, and compare this to the interactomes of organisms which still have early, incomplete interaction data sets.
top
Subject: Qualitative modeling and simulation
Presenting Author: Maheen Kibriya, Chapman UniversityAuthor(s):
Louis Ehwerhemuepha, Chapman University, United States
Abstract:
The importance of computing in biological and life sciences cannot be overemphasized. It is imperative for students in biological sciences to be introduced to computing at the undergraduate level, and our work presents the view of undergraduates toward this shift to an interdisciplinary field. We developed a simple nucleotide sequence analysis program written in Python and discuss our experience in learning and using Python to solve simple biological problems. The aforementioned sequence analysis program was tested using sequence data from the tuberculosis database (www.tbdb.org), while some high level functions freely available in BioPython are briefly discussed.
top
Subject: Metogenomics
Presenting Author: Dan Knights, University of MinnesotaAuthor(s):
Rinse Weersma, University Medical Center Groningen, Netherlands
Dirk Gevers, Broad Institute of Harvard and MIT, United States
Gerard Dijkstra, University Medical Center Groningen, Netherlands
Hailiang Huang, Massachusetts General Hospital, United States
Andrea Tyler, Mount Sinai Hospital, Canada
Suzanne von Sommeren, University Medical Center Groningen, Netherlands
Floris Imhann, University Medical Center Groningen, Netherlands
Joanne Stempak, Mount Sinai Hospital, Canada
Caitlin Russell, Massachusetts General Hospital, United States
Jenny Sauk, Massachusetts General Hospital, United States
Jo Knight, University of Toronto, Canada
Mark Daly, Massachusetts General Hospital, United States
Curtis Huttenhower, Harvard School of Public Health, United States
Ramnik Xavier, Massachusetts General Hospital, United States
Abstract:
Human genetics and host-associated microbiomes have each been associated with inflammatory bowel disease (IBD), however IBD risk cannot be fully explained by either factor alone. Recent findings implicate genotype-enterotype crosstalk as a contributor to IBD pathogenesis. However, there has been no large study of complex genome-microbiome interactions in humans. We have performed such a study using bacterial 16S ribosomal RNA enterotyping and Immunochip genotyping from intestinal mucosal biopsies in three independent cohorts totalling more than 500 individuals. We present methodology, validated internally between cohorts, to test for host genetic locus interaction with taxonomic and functional components of the microbiome. In a targeted analysis integrating fine mapping of causal variants, we find nucleotide oligomerization domain 2 (NOD2)-specific risk associated with known IBD-related imbalances in bacterial taxa, including increased Gammaproteobacteria and Escherichia. NOD2 has known roles in management of commensal bacteria, and a strong genetic signal for increased IBD risk. Using imputed bacterial metagenomes we also find NOD2 risk linked to increased sulfur reduction and lipopolysaccharide biosynthesis. These findings point to pathobiont expansion and bacterial production of genotoxic agent hydrogen sulfide, both involved in inflammation and IBD pathogenesis. In a novel omnibus tests we demonstrate links between host innate and adaptive immune pathways and broad enterotype composition. Our analysis demonstrates the ability to uncover novel interactions from paired genotype-enterotype data and that host genetics is linked to microbial dysbiosis in IBD.
top
Subject: Simulation and numeric computing
Presenting Author: David Knox, University of Colorado Anschutz Medical CampusAuthor(s):
Abstract:
Transcriptional regulation is the complex system behavior arising from the interaction of numerous regulators with DNA. Experimental efforts have unraveled the function of many individual components of the process, but the systems level behavior remains unpredictable. Growing evidence indicates that the transcriptional response of the system emerges not solely from the individual components, but rather by their collective behavior -- including competition and cooperation. The environment surrounding DNA undergoes millions of molecular interactions every second, resulting in continuous changes to the configuration of physically bound molecular components. It is from these stochastic, temporal, and spatial interactions of regulatory components that transcriptional regulation arises within each cell. Encapsulating our understanding of these interactions into computational models is essential for a full understanding of transcriptional regulation.
Our goal was to create biologically realistic computational models of Transcriptional Regulation that not only capture the behavior of several individual components, but also describe the dynamic and stochastic behavior of competing components. To this end we have developed an automated rule builder to not only create stochastic simulation rule sets, but also basic visualizations of the resultant simulations. Our modeling framework captures the competition between regulatory proteins, and more importantly, the dynamics of regulatory events occurring within individual cells.
top
Subject: Optimization and search
Presenting Author: Irena Lanc, University of Notre DameAuthor(s):
Scott Emrich, University of Notre Dame, United States
Abstract:
The ability to extract meaningful results from genomic data depends on access to a well-constructed genome assembly. However due to limitations of time and money, manual finishing and validation are sometimes neglected. This issue is compounded by the fallibility of assemblers, which struggle with repeats, chimeric reads and contaminants and can vary widely in the caliber of their assemblies. In addition, some assemblers fare better than their competitors on specific datasets. Consequently, multiple assemblies are often generated to compensate for these issues. These assemblies can be the result of using different software, varying parameters, or incorporating new libraries. Deciding which of them is best can be daunting, as heuristics like N50 or number of contigs are too broad to adequately capture the overall quality of the resulting assemblies. HIPPO is designed to alleviate these problems by automatically merging multiple assemblies based on their sequence quality. The measure of quality derives from our previous work in assembly validation, where we gauged the correctness of an assembled sequence by generating a vector of quantifiable values for consecutive windows across the assembly. The two-step approach begins with a full genome alignment which is fed into a bipartite matching to identify candidate sections for improvement. These sections are woven together by a simplified de Bruijn path process to produce a meta-assembly consisting of only the highest-quality sections. The tool can extend, fill gaps, and replace dubious sections of sequence. HIPPO allows users to combine multiple assemblies and incorporate high-quality supplemental regions such as fosmids.
top
Subject: Graph Theory
Presenting Author: Ryan Langendorf, University of Colorado, Boulder
Abstract:
Network alignment has proven insightful in settings ranging from protein interaction networks to ontology matching. In ecology, the ability of networks to holistically describe systems and their dynamics has proven useful to community ecology as well as its applications to conservation biology and political ecology. However, analyses of networks have historically entailed correlating theoretical properties with ecological ones, such as the relationship between complexity and stability. While conceptually informative, such approaches have a limited ability to mechanistically account for these relationships and only a superficial means by which to compare networks. In light of this, a flow algorithm was developed to align networks conceptually spreading water along interactions resulting in accumulation within species. Through the distributions of accumulated flow at nodes, networks of varying size and constitution can be directly and quantitatively compared addressing the relationship between a system’s structure and its emergent ecological properties. Compared to singular, often static metrics this technique allows for indirect interactions and network dynamics to be captured. Moreover, systems can be compared to themselves through time allowing for questions of complexity, scale-dependence, and thresholds to be addressed more robustly by quantifying the structural differences underlying temporal changes in a system. This technique allows the functional importance of a system’s structure, be it ecological or otherwise, to be approached more mechanistically.
top
Subject: Optimization and search
Presenting Author: Mario Latendresse, SRI International
Abstract:
RouteSearch is a new Web accessible metabolic engineering tool available as part of BioCyc since March 2013. It enables searching for optimal metabolic linear routes between a start compound and a goal compound. The optimality criteria are the weighted sum of the costs of the reactions used, and the weighted sum of the costs of atoms that are lost in the transformation from the start compound to the goal compound. These costs and the number of minimum cost routes to find and display are user selectable. The routes are displayed as a series of connected enzymatic reactions including chemical structures of the substrates, where the conserved moeities within each metabolite are shown using colors. By using a graphical interface, the user can also easily identify each atom conserved or lost along each route. RouteSearch uses two algorithms to search for optimal routes: the Bellman-Ford algorithm that finds the least cost route, and a more general branch and bound search algorithm that can find several minimum cost routes. RouteSearch also uses a preferred organism to search -- a chassis in metabolic engineering terms, such as E. coli -- and a library of additional reactions, which is the MetaCyc database. The cost of using a reaction from MetaCyc is usually set higher than using a reaction from the chassis. In this way, new and more productive metabolic routes can be found for the chassis by adding reactions from MetaCyc. We will also briefly describe the computation of atom mappings for MetaCyc. Atom mappings are used by RouteSearch to track the atoms conserved and lost in a route.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Tiffany Liang, San Diego State UniversityAuthor(s):
Jason Rostron, San Diego State University, United States
Jeremy Frank, San Diego State University, United States
Daniel Cuevas, San Diego State University, United States
Anca Segall, San Diego State University, United States
Forest Rohwer, San Diego State University, United States
Robert Edwards, San Diego State University, United States
Daniel Garza, Evandro Chagas Institute, Brazil
Abstract:
Viruses are the most diverse biological entities on earth. However, they also have the least characterized genetic, taxonomic, and functional diversity. In metagenomic analyses of viral communities from various environments, most sequences are unrelated to any known sequences; for example, about 90% of the viral sequences found in marine environments are unknown. The goal of this study is to characterize the function of unknown viral genes and identify those that alter host metabolism.
Viral metagenomes were collected from filtered seawater from Pacific coral reefs, sequenced by Roche 454 technology, and open reading frames were predicted from those sequences. Genes were synthesized and cloned into E. coli. These clones have been characterized in several different ways. To investigate these clones that affected metabolic processes, the metabolites were identified by gas chromatography-coupled time-of-flight (GC/TOF) mass spectrometry. In total 423 metabolites were found, however only 15% of those matched known compounds. We are identifying the specific metabolites produced or affected by the over expression of phage proteins to predict physiological roles for these proteins that can then be tested experimentally. We have also analyzed metabolic changes associated with expression of proteins with known functions that are involved in central metabolism; and clustering of these changes allows us to predict functions for other proteins. We are building a systematic analysis pipeline that can process matabolomics data for downstream analysis of metabolomics and related data sets.
top
Subject: Other
Presenting Author: Ettie Lipner, University of Colorado-Denver/National Jewish HealthAuthor(s):
Yaron Tomer, Mount Sinai, United States
Janelle Noble, Children's Hospital Oakland Research Institute , United States
Cristina Monti, University of Pavia, Italy
Barbara Corso, University of Pavia, Italy
John Lonsdale, National Disease Research Interchange, United States
Abstract:
We conducted a linkage analysis to identify susceptibility loci for microvascular complications of type 1 diabetes (T1D). Using 402 SNP markers, our analysis used the phenotypes: 1) any microvascular complication, 2) retinopathy, 3) nephropathy, 4) neuropathy. When using “any complication” as the phenotype, we identified two linkage peaks: one located at HLA (HLOD=2.90) and another, novel locus telomeric to HLA (HLOD=3.13). These peaks were also evident when retinopathy was the phenotype (HLODs of HLA=2.69, telomeric locus=3.30). We did not find evidence for linkage for nephropathy or neuropathy. Previously published evidence suggest that DRB1 locus alleles affect complications expression, we stratified on families whose probands were positive for DRB1*03:01 and DRB1*04:01. Using the phenotype “any complication” and including only DRB1*03:01-positive families, the HLA peak decreased (HLOD=1.82) from the unstratified analysis (HLOD=2.90), and a peak centromeric to HLA appeared (HLOD=1.27). When stratifying on DRB1*04:01-positive families, the linkage evidence for HLA (HLOD=3.83) and the telomeric locus (HLOD=3.69) went up, despite the drop in sample size with stratification. When using the phenotype retinopathy, we observed the same increase in linkage peaks (HLOD at HLA=3.62, telomeric locus=3.76). These observations suggest that DRB1*04:01 interacts with the telomeric locus to produce complications’ susceptibility. Simultaneously, the drop in linkage evidence for DRB1*03:01 confirms a protective effect seen in our previously reported analysis (Lipner et al, 2013). Based on large differences in the HLOD scores, we argue that the DRB1*03:01-positive and DRB1*04:01-positive groups are genetically distinct, a finding in accordance with the observation that DRB1*03:01 is protective for retinopathy.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Fayyaz Minhas, Colorado State UniversityAuthor(s):
Asa Ben-Hur, Colorado State University, United States
Abstract:
We have developed a novel partner-specific protein-protein interaction site prediction method called PAIRpred that uses the sequences and unbound structures of two proteins in a complex, and is based on support vector machines (SVMs). Unlike most existing machine learning methods for this problem, PAIRpred uses information extracted from both proteins in a complex using pairwise kernels to predict inter-residue contacts. Due to its partner-specific nature, PAIRpred presents a more accurate model of protein binding and is able to generate more detailed predictions. In order to better model the problem, we present an extension of SVMs that can capture the pairwise constraints that two distant residues in a protein cannot simultaneously interact with the other protein in a complex. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We have compared PAIRpred's performance to existing methods such as ZDOCK, PPiPP and PredUS. PAIRpred offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We have studied the contribution of different sequence and structure features along with the effect of binding-associated conformational change on prediction accuracy. As an illustration of potential applications of PAIRpred, we have used it to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. More information on PAIRpred is available at: http://combi.cs.colostate.edu/supplements/pairpred/.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Alexandra Mirina, Albert Einstein College of Medicine of Yeshiva UniversityAuthor(s):
Lirong Wei, Albert Einstein College of Medicine of Yeshiva University, United States
Matthew Scharff, Albert Einstein College of Medicine of Yeshiva University, United States
Aviv Bergman, Albert Einstein College of Medicine of Yeshiva University, United States
Abstract:
B-cell Chronic Lymphocytic Leukemia (CLL) is the most common adult leukemia in the western hemisphere. It is characterized by an excessive proliferation of one B-cell clone over others. Previous studies suggest that antibody selection by some unknown antigen(s) plays significant role in CLL. Therefore, the analysis of mutations in immunoglobulin (Ig) variable region sequences derived from B-cells of CLL patients may give us valuable insights into the disease mechanism. However, it is not a trivial task as there is a need for methods to detect selection pressure in antibodies. For this purpose we developed a novel approach for detecting selection in Ig. Our methodology is based on comparing in vivo data in question obtained from B-cells of CLL patient to a reference dataset. We construct such a dataset based on in vitro data obtained from a biochemical assay, which allows us to exclude the possibility of selection presence in reference sequences. We applied our method to datasets of two different human immunoglobulin heavy chain (IGHV) variable regions: IGHV4-34 and IGHV3-23. Comparison of CLL B-cell sequences of these regions to sequences of healthy donors’ B-cells, which were under antigen selection, indicated a correlation and, thus, suggesting presence of antibody selection in CLL. Interestingly, different IGHV regions of CLL B-cells correlate with different subtypes of healthy donors’ B-cells, which suggests that mechanism of the disease may vary between patients depending on IGHV regions in prevailing B-cell clones.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Sean Mooney, Buck Institute for Research on AgingAuthor(s):
Iddo Friedberg, Miami University, United States
Abstract:
A major challenge of the post-genomic era is understanding the function and disease associations of gene products. We are discovering new proteins far faster than we can characterize them experimentally. Most genome projects and derived databases rely fully on automated functional annotations, making the increase in annotation accuracy and coverage a prime goal for annotation algorithms. Understanding the accuracy of these function prediction algorithms is of primary importance to the process of translating sequence data into biologically meaningful information. Here we present the results of the first Critical Assessment of Function Annotations (CAFA) held during 2010-2011 and the challenge of the second CAFA experiment underway now. Thirty-four research groups worldwide participated in the first experiment, employing over 50 function annotation algorithms. The prediction methods were assessed using ROC curves, precision/recall curves, and variations on semantic similarity as applied to the Gene Ontology. During this presentation, I will discuss the results of the first CAFA experiment, the challenges we faced in assessing the results, and the future of CAFA. I will also describe the new experiment which will include biological process, molecular function, cellular component and human disease prediction tracks. Finally, I will describe ways in which you, the community, can participate.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Taj Morton, Oregon State UniversityAuthor(s):
Abstract:
High-throughput sequencing protocols are now able to provide vast and detailed quantitative data on RNA polymerase II transcription start sites (TSSs). Combining these datasets with machine learning techniques can provide valuable new insights into the regulation and production of these mRNA transcripts. Additionally, these models can be used to build high-accuracy predictive models which rely solely on sequence content. The use of sequence content-based models is attractive because they can predict transcriptional events even in species with sparsely annotated genomes or those which lack genome-wide experimental data. The resulting models can be used to suggest or infer regulatory mechanisms which may control transcript production. Here we provide an overview of a sequence content-centric approach to the use of machine learning tools in TSS data analysis. We present a machine learning model for highly accurate TSS prediction, present preliminary results on an mRNA recapped product classifier in Arabidopsis thaliana, and show how the Elastic Nets technique can be used to improve the interpretability of TSS prediction models.
top
Subject: Metogenomics
Presenting Author: Amir Muhammadzadeh, University of SaskatchewanAuthor(s):
Vanessa Pittet, University of Saskatchewan , Canada
Stephen Johnson, University of Saskatchewan, Canada
Anthony Kusalik, University of Saskatchewan, Canada
Abstract:
The main goal of metagenomics is to characterize the structure and dynamics of com- munities of non-clonal microorganisms. One step in metagenomic analysis is reconstruction of genomes by assembling sequence reads. Unlike a traditional sequencing project, which aims to determine the complete genome sequence of a single organism, metagenomic anal- yses require thousands of (partial) genomes from a microbial community to be sequenced and assembled simultaneously. Over the past few years, different methods have been de- veloped or revised specifically for the de novo assembly of next generation sequencing data; however, there are only a few tools that specifically focus on metagenomic data. Given the considerable difficulties involved in assembling such data, including inadequate and par- tial sampling of some genomes, different organism compositions and relationships, and the presence of repetitive fragments, reconstructing the full metagenome is a very demanding task.
Here we provide an evaluation of current de novo short read assembly tools on metage- nomic data. We test a number of state-of-the-art assemblers that were designed specifi- cally for metagenomic data, as well as some that were not. The accuracy, performance, and computational requirements of these assemblers were evaluated using three datasets of simulated sequence reads, each having a different community complexity (low, medium, or high), as well as real reads obtained from the sequencing of environmental samples using Ion Torrent technology. Our evaluation of assemblers suggested that although no single assembler performed best on all of our criteria, MIRA slightly outperformed the other programs.
top
Subject: Qualitative modeling and simulation
Presenting Author: Nirvana Nursimulu, University of TorontoAuthor(s):
Stacy Hung, University of Toronto, Canada
Melissa Chiasson, NIAID, National Institutes of Health, United States
James Wasmuth, University of Calgary, Canada
Michael Grigg, NIAID, National Institutes of Health, United States
John Parkinson, University of Toronto, Canada
Abstract:
Estimated to infect at least a third of the world’s population, the Apicomplexan parasite, Toxoplasma gondii, represents a major threat to immunocompromised individuals and pregnant women, especially due to the limited efficacy of current therapeutic interventions. Since metabolism plays an essential role in providing energy and the basic building blocks required for growth, drug-development programs are now focussing more on targeting metabolic enzymes. We hypothesize that metabolic potential plays a key role in determining the virulence of different strains. Given often nonintuitive relationships between enzymes and pathways, constraints based models such as flux balance analysis (FBA), have emerged as indispensable tools to study the organization and operation of metabolic networks. Here we present a novel application of FBA that leverages microarray data to explore the impact of differential enzyme expression observed between virulent and avirulent strains of T. gondii. Our model correctly predicts the increased growth rate of the more virulent type I strain, relative to type II; further analysis predicts the increase in growth rate to result from increased energy production via upregulation of the glycolytic, pentose phosphate and TCA-cycle pathways. These findings highlight a regulatory route which, in addition to conferring growth rate plasticity, may impact the parasite’s outstanding ability to infect a broad range of hosts. Moreover, drug assays confirm strain-specific sensitivities of several reactions, as predicted by in silico single knock-out experiments. This study demonstrates how expression data can be integrated into a model to give robust strain-specific predictions.
top
Subject: Metogenomics
Presenting Author: Anelia Horvath, McCormick Genomics and Proteomics CenterAuthor(s):
Mercedeh Movassagh, McCormick Genomics and Proteomics Center, United States
Kamran Kowsari, McCormick Genomics and Proteomics Center, United States
Ali Seyfi, McCormick Genomics and Proteomics Center, United States
Maria Kokkinaki, Georgetown University, United States
Nady Golestaneh, Georgetown University, United States
Abstract:
Among the major mechanisms that affect the splicing process are nucleotide changes, which disrupt or create binding sites for the spliceosome components. Most of the currently available tools for splice variants annotation employ preexisting knowledge, and may miss variants acting through unknown mechanisms. We have developed a novel, experimentally based, directly observational approach to identify potential cis-acting splice modulating variants from second generation sequencing RNA datasets. Our approach is based on screening for co-existence of SNP-call and junction abrogation within a single uninterrupted sequencing read, which represents a copy of the original RNA molecule. Our strategy employs the assumption that SNPs frequently found on reads spanning into the intron (vs continuing in the next exon) can indicate splice-modulating potency of the nucleotide change. Applying our pipeline on five in-house transcriptomes highlighted known and novel splice-modulating SNPs, located both within and outside canonical splice sites. Selected alternatively spliced alleles have been confirmed through wet lab studies.
This is the first experimentally based, high-throughput pipeline to identify cis-acting splice-modulating SNPs; it highlights novel splice implicated genetic variants and provides an innovative strategy to re-visit the splice-modulating potential of SNPs located in consensus sequences, and traditionally considered critical for the splicing regulation.
top
Subject: Text Mining
Presenting Author: Natalya Panteleyeva, University of ColoradoAuthor(s):
Kevin Cohen, University of Colorado, United States
Abstract:
A corpus of clinical data was used to investigate the hypothesis that there are correlations between pairs of event types and the temporal links between them. A corpus of about 98,000 words that had been annotated with events, event types, TIMEX3 expressions, and temporal links was examined for such associations. It was found that in fact most pairs of event types show a strong preference for or against a particular type of temporal link. It was also noted that all possible pairs of event types occur even in this relatively small corpus. The preference of specific pairs of event types for particular types of temporal links has implications for natural language processing systems, including establishing baselines for their performance and providing a priori knowledge that can be used to inform the construction of both rule-based and machine-learning-based systems for labeling temporal links in clinical documents. More basic questions about the linguistic expression of temporal relations in clinical text are examined, such as the extent to which they are sequential or not and the extent to which they are intersentential versus intrasentential. Whether surface linguistic cues from morphology, syntax, and lexicon enhance accuracy in establishing temporal link types is addressed.
top
Subject: Metogenomics
Presenting Author: John Parkinson, Hospital for Sick ChildrenAuthor(s):
Abstract:
Whole-microbiome gene expression profiling (‘metatranscriptomics’ or ‘RNA-seq’) has emerged as a powerful means of gaining a mechanistic understanding of the complex inter-relationships that exist in microbial communities. However, due to the inherent complexity of microbial communities and the lack of a comprehensive set of reference genomes, currently available computational tools for metatranscriptomic analysis are limited in their ability to functionally classify and organize these sequence datasets. To meet this challenge we have been developing methods that combine accurate transcript annotation with systems-level functional interrogation of metatranscriptomic datasets. As part of these methods, we present GIST (Generative Inference of Sequence Taxonomy), which combines several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce robust taxonomic assignments of metatranscriptomic RNA reads. In addition to identifying taxon-specific pathways within the context of a pan-microbial functional network, linking taxa with specific functions in a microbiome will produce deeper understanding of how their loss or gain alters microbiome functionality. Applied to real as well as synthetic datasets, generated using an inhouse simulation tool termed GENEPUDDLE, we demonstrate an improved performance in taxonomic assignments over existing methods.
top
Subject: Metogenomics
Presenting Author: Robin Paul, Arizona State UniversityAuthor(s):
Jason Steel, Arizona State University, United States
Kristina Buss, Arizona State University, United States
Petra Fromme, Arizona State University, United States
Abstract:
Leptolyngbya Heron Island(L.HI) is a newly isolated strain of cyanobacteria isolated from the Great Barrier Reefs in Australia. This strain exhibits the phenomenon of chromatic acclimation in which the cyanobacteria selectively express light harvesting proteins according to the wavelength of light its exposed to. To explore this phenomenon, sequencing the genome of the cyanobacteria is essential. However, this strain of cyanobacteria has strong symbiosis with other bacteria in its natural habitat making it impossible to obtain axenic cultures. The L.HI genome was sequenced using Illumina and reads were assembled using the Abyss software package. Initial %GC analysis of assembled scaffolds showed multiple peaks confirming the presence of heterotroph scaffolds. Tetranucleotide frequencies were calculated followed by principal component analysis in which the cyanobacterial scaffolds were clustered together. The selection of the cyanobacterial scaffolds were further confirmed by devising a BLAST algorithm such that it only selected scaffolds which contained a gene which matched with some gene in Leptolyngbya sp. PCC 7375, a closely related cyanobacteria. This method also yielded exactly the same set of scaffolds as obtained using tetranucleotide analysis. %GC analysis yielded only a single peak. Genome annotation was carried out using NCBI Prokaryotic Genomes Annotation Pipeline followed by validation against the NCBI non-redundant database. This study shows that a complete genome sequence for a prokaryote can be obtained from nature which may have a very strong symbiotic relationship with other contaminating bacterial species by a combination of analyzing tetranucleotide frequencies, BLAST and %GC results. NCBI scaffold accession no. AWNH01000001-AWNH01000119
top
Subject: Graphics and user interfaces
Presenting Author: Megan Pirrung, University of Colorado
Abstract:
Sequencing technologies are getting cheaper and producing vast amounts of data, especially in the field of microbial ecology. Proper visualization of biological data is key to informative analysis and insight. Data analysis by users through informative, useful, and responsive visualizations is key to harnessing the true potential of big data. We propose an experiment that will help to determine which types of visualization techniques are most informative for microbial ecology data in both public and expert scientific audiences. To perform this experiment on a large number of subjects in a systematic way, we have created a modular system with easily substituted visualization methods. We have created dynamic visualizations that parallel the visualizations available in QIIME using the d3 (Data Driven Documents) javascript library, and a visualization-testing framework that will be used to display the results of the American Gut Project. A visualization technique for one particular analysis as performed in QIIME will be randomly selected from the set of visualizations appropriate for the data selected and shown to the user. The user will also be presented with a questionnaire that will let us determine which visualizations allow users to answer the most questions correctly. We expect that the scores will indicate that certain visualization techniques are more appropriate for certain types of data, and that certain visualizations may be found informative for one audience over another, in public and scientific audiences.
top
Subject: Networking, web services, remote applications
Presenting Author: Natapol Pornputtapong, Chalmers University of TechnologyAuthor(s):
Jens Nielsen, Chalmers University of Technology, Sweden
Abstract:
In recent years human tissue-specific genome-scale metabolic (h-tGEM) modeling has been provided much new information about human metabolism, in particular in connection with disease development. In order to efficiently manage and utilize this kind of data, we built the Human Metabolic Atlas (HMA) website as an online resource to provide comprehensive human metabolic information as models and a database for further specific analysis as well as to communicate with the wider research community. The metabolic models are illustrated using a web based metabolic map visualization system and provided in SBML file formats, which can be opened in most pathway and analysis software. With the visualization system, a summary of the provided h-tGEMs is overlaid on KEGG metabolic pathway maps with a zoom/pan user interface. Besides the models, users can easily access human reaction data, gathered from all h-tGEMs through a data query interface. The reaction data are standardized and organized by an internal developed Object-oriented database management system. Connecting to the database, users can use the provided web interface to easily retrieve reaction data with specific keywords or by using gene, protein, compound and cross reference in JSON and CSV format. This online resource is a useful tool for studying human metabolism at the specific cell level, organ level and the for the overall human body.
top
Subject: Networking, web services, remote applications
Presenting Author: Steven Reisman, Loyola University of ChicagoAuthor(s):
George Thiruvathukal, Loyola University of Chicago, United States
Konstantin Läufer, Loyola University of Chicago, United States
Abstract:
The rapid mutation rates of retroviruses such as HIV prove challenging when developing molecular therapies. RNA-interference (RNAi) has been recently developed as a means to destroy a known targeted sequence, and shows potential as an HIV-1 therapy. The designed small interfering RNA molecule used for the interference mechanism must be highly accurate, as it will only bind to a target with near perfect complementarity which current research suggests must be from 18-25 nucleotides in length. Therefore, in order to avoid viral escapes, the siRNAs must target the most highly conserved non-variant regions. Identifying such regions necessitates a multifaceted approach, considering functional and structural constraints. We have developed a data repository to facilitate such analyses. Using the wealth of sequence data publicly available, sequences are parsed, allowing to expose this data in the form of a RESTful web service, allowing for users to query the data based on several parameters, including country of isolation, collection date, and gene. We are now able to observe how sequence conservation varies with respect to distribution throughout the world. Scripts have been developed to align user-selected sequences such that the most non-variant regions can be identified by implementation of a Longest Common Subsequence (LCS) algorithm. In doing so, we have provided a method for future research to identify potential RNAi targets for specific subpopulations rather than attempting to find a non-specific global solution. While focusing here on HIV, the tools developed can be applied to any viral species of interest.
top
Subject: Data management methods and systems
Presenting Author: Diego Salvanha, University of Sao Paulo and Institute for Systems BiologyAuthor(s):
Ricardo Vencio, University of Sao Paulo, Brazil
Nitin Baliga, Institute for Systems Biology, United States
Abstract:
Many analysis and visualization tools have emerged as a result of the “Big Data” revolution in the biological sciences. Unfortunately, the enthusiasm to mine these data sets led to the development of highly-specialized software with few resources dedicated to linking diverse tools and datasets. In systems biology, investigators typically want to observe biological systems across scales (e.g. molecular types or time scales). Since each analysis in this workflow typically involves its own custom-designed software, a centralized data integration tool would be invaluable. Here we describe the WebGoose, a data manager designed to integrate existing software and experimental data on the web.
WebGoose is a browser-independent, HTML5-compliant data manager integrated into the Java-based Gaggle framework (http://gaggle.systemsbiology.net/). Like Gaggle, WebGoose is a light-weight data service that provides interoperability between web applications. WebGoose implements two distinct but related modules: A front end interface allowing data source/target selection and a back-end module which is responsible for transferring data to Gaggle. Once an independent web-resource is integrated to Gaggle using WebGoose, it becomes a full-fledged Gaggle-goose -- automatically receiving the capability to share data between all other developed geese. This allows third-party web-applications to access Gaggle-enabled databases (such as KEGG or STRING) as well as the suite of Gaggle-enabled software (such as R and MeV) with relatively little configuration.
The WebGoose makes it easy to integrate diverse software applications on the web. The application is open-source and can be download at http://labpib.fmrp.usp.br/~dmartinez/webgoose
top
Subject:
Presenting Author: Swapna Seshadri, Research Institute, Hospital for Sick ChildrenAuthor(s):
Tim Gilberger, McMaster University, Canada
Abstract:
Protein palmitoylation is the only reversible post-translational mechanism utilising a hydrophobic anchor known to dynamically regulate a protein’s function by influencing its subcellular localization, stability, and interaction. Although this process is ubiquitous in eukaryotes, a recent study uncovered hundreds of palmitoylated proteins in P. falciparum. Therefore, characterizing the suite of enzymes catalyzing this process (Palmitoyl Acyl Transferases (PATs)) in apicomplexan parasites is essential for understanding various aspects of parasite biology. We conducted a comprehensive survey to identify and classify PATs from complete genomes of 16 parasitic apicomplexans and 2 closely related free-living protists (ciliates). Using HMMER, 159 and 138 PATs were identified in apicomplexans and ciliates, respectively. Classification is confounded due to lower resolution stemming from short (~50aa) conserved catalytic domain combined with presence of ankyrin repeats in many sequences. Analysis revealed a ~170aa region with sufficient information to distinguish them into 7 major clades and 14 sub-clades, using Bayesian and maximum likelihood phylogenetic methods. The sub-clades demonstrate distinct patterns of sequence conservation and indels, providing molecular signatures for possible sub-functionalisation. A structural model of the catalytic domain was generated, providing a molecular perspective of these signatures. Overall, 5 sub-clades are apicomplexa-specific, containing members localized to rhoptries and inner membrane complex, organelles unique to apicomplexa that are involved in host cell invasion. Further, 2 clades and 2 sub-clades contain yeast and human orthologs indicating a role in secretory pathway. In summary, apicomplexans have evolved PATs to serve as an integral part of the biological machinery required to facilitate their parasitic life-style.
top
Subject: Other
Presenting Author: Meenakshi Sharma, University of HoustonAuthor(s):
Yuriy Fofanov, University of Texas Medical Branch, United States
Abstract:
Disruption of methylation patterns has been associated to genomic instability and is a hallmark of cancer. To identify “methylation signatures” in a genome, affinity-based methods like immuno-precipitation (IP) along with high-throughput sequencing (HTS) are used in comparative and genome-wide studies. Mapping reads (subsequences) generated by HTS to human genome results in large numbers of alignments in the repeat-rich regions. The repetitive elements introduce bias and interfere with the accurate identification of differentially methylated regions (DMRs). We present novel computational approaches to detect DMRs in both unique and repetitive segments of the entire genome.
To eliminate/minimize bias introduced by repetitive DNA regions which cause coverage spikes (pile ups) affecting coverage analysis, we created “sequence length specific maps” of all the repeatable and unique locations in the human genome. This was accomplished by the “disassembly” of the human genome into all possible n-mers equal to the read length and alignment (mapping) of all the subsequences present in more than one copy in the reference genome. All the identified unique locations can now be used to estimate the differences in the “reads coverage” among the samples.
Log-transformations, MA-plots, and Z-scores were used to identify several genomic regions containing at least 44 genes that were hyper-methylated in dexamethasone (dex)-resistive and hypo-methylated in dex-sensitive Acute Lymphoblastic Leukemia cell lines. Our novel approach enables researchers to detect methylation alterations on a global scale and select candidate genes for locus-specific studies.
top
Subject: Machine learning, inference and pattern discovery
Presenting Author: Evgeny Shmelkov, New York University School of MedicineAuthor(s):
Emeline Maillet, New York University School of Medicine, United States
Arsen Grigoryan, New York University School of Medicine, United States
Timothy Cardozo, New York University School of Medicine, United States
Abstract:
Historically, the mechanism of drug action is conceptualized via its interaction with a single cognate receptor, agnostic to the genetic expression of the latter. However, the entire pharmacologic activity of a drug, including both its beneficial and adverse effects, derives also from its “off-target” actions (polypharmacology). Additionally, the expression pattern of all the drug’s receptors is an essential factor that localizes the effect of a drug to a particular tissue. Thus, the true molecular signature of a drug consists of a complete vector of all its physiologically relevant receptor interactions across the spectrum of all receptors expressed in various tissues (histoReceptOmic signature). Here, we defined a novel histoReceptOmic signature for the atypical pharmacologic action (“atypia”) of the antipsychotic drug clozapine, i.e. its beneficial effects that the typical antipsychotic drug chlorpromazine does not exhibit. Specifically, we derived the atypia signature by subtracting signatures of chlorpromazine and clozapine, obtained by integrating drug:receptors affinities with receptors gene-expression data. The generalized extreme Studentized deviate test was used to identify only physiologically significant tissue-specific drug:receptor interactions. Our results suggest that the common antipsychotic effects of clozapine and chlorpromazine are mediated through the 5-HT2a and 5-HT2c receptors in prefrontal cortex and caudate nucleus respectively, histamine H1 receptors in superior cervical ganglion, and muscarinic acetylcholine M3 receptors in prefrontal cortex. In contrast, targets exclusive to clozapine are dopamine D4 receptors in pineal gland, and muscarinic acetylcholine M1 receptors in prefrontal cortex. These results provide novel perspectives on mechanisms of action of antipsychotics and drug discovery in schizophrenia.
top
Subject: Metogenomics
Presenting Author: Genivaldo Silva, SDSUAuthor(s):
Abstract:
Microbes are more abundant than any other organism, and it is important to understand what those organisms are doing and who they are. In many environments a large majority of the members of the microbial community cannot be cultured. Metagenomics uses high throughput sequencing, a fast and cheap sequencing method provided by the next generation of sequencing technologies. One of the major goals in metagenomics is to identify the presence of organisms in the microbial community from a huge set of unknown DNA sequences. This profiling has valuable applications in multiple important areas of medical research such as disease diagnostics. Nevertheless, it is not a simple task, and many approaches that have been developed are slow and depend on the read length of the DNA sequences. We designed FOCUS, an innovative and agile composition based model using non-negative least squares to profile and report the organisms present in metagenomic samples and their relative abundance without sequence length dependencies. The program was tested with simulated and real metagenomes, and the results show that our approach accurately predicts the organisms present in random communities faster than the available tools. The code and web-sever of FOCUS is freely available at http://edwards.sdsu.edu/FOCUS.
top
Subject: Other
Presenting Author: Charlotte Siska, University of Colorado DenverAuthor(s):
Abstract:
Different types of molecular features such as transcripts, proteins and metabolites can be measured using various –omics platforms and techniques. As it is becoming more common to generate –omics data on the same samples, methods are being developed that integrate the different types of data. We propose the use of differential correlation to identify pairs of molecular features (e.g. protein and transcript) with correlation that differs between disease groups. Molecular features that have differential correlation between groups are assumed to be involved in biological processes that are associated with disease status. We apply differential correlation to –omics data from NCI-60 cell lines to investigate cancer types and from human blood samples to study Chronic Obstructive Pulmonary Disorder (COPD). Results are validated using pathway-finding algorithms, where it is assumed that pairs of molecular features with significant differential correlation will be close to each other in a biological network. We also evaluate differential correlation using experimentally validated miRNA-mRNA interactions. We find that pairs of molecules that show differential correlation are close in biological networks compared to unrelated, randomly chosen pairs. We also discovered that differentially correlated pairs are enriched for experimentally validated interactions. In summary, we demonstrate how differential correlation can be used to predict novel molecular interactions associated with disease status, in addition to confirming the role of previously known molecular interactions.
top
Subject: Graphics and user interfaces
Presenting Author: Corinna Vehlow, Visualization Research Center, University of StuttgartAuthor(s):
David Kao, University of Colorado Anschutz Medical Campus, United States
Abstract:
Biologists commonly analyze experimental data using biological networks, such as gene-expression correlation networks, to explain disease specific patterns and identify genotype-phenotype relationships. Biomedical knowledge from various databases and the literature can be integrated with these data networks to allow analysts to interpret experimental data in the context of existing knowledge. While these combined networks provide a rich resource and profound basis for data analysis, they are difficult to explore and understand since they are very dense. Using current static visualization approaches, it takes time and expertise to “untangle the hairball” and manually extract sub-networks that can explain a phenomenon or tell a meaningful biological story. To improve this analytical workflow, we developed a visualization approach that applies the concept of degree-of-interest (DOI) functions to highlight or filter particular parts of a network that are relevant for a specific question or task. We also use these DOI functions to automatically extract and lay out sub-networks in a way that DOI-based groups and their intersections become visually apparent, e.g., extracting a sub-network that includes all nodes involved in a set of pathways of interest and visually arranging these nodes based on their pathway information. To facilitate the analysis of extracted sub-networks in the context of the complete network, the network visualizations are linked through a brushing and linking feature. DOI functions can model various analytical facets, including an analyst’s background and interest, properties of the experimental data, and phenotype information. Hence, they provide a generic and powerful approach for analyzing biological networks.
top
Subject: Other
Presenting Author: Kwanjeera Wanichthanarak, Chalmers University of TechnologyAuthor(s):
Abstract:
Unicellular organisms, as other cells, such as the model organism yeast Saccaromyces cerevisiae have to develop stress-response strategies in order to deal with various stresses they may encounter in a dynamically changing environment. In the last decade many have studied stress responses in yeast, at the level of genome wide DNA transcriptional response using DNA microarray technology, mostly focused on environmental changes such as aeration, temperature, pH, nutrients and osmolarity. All these data are very interesting and useful however, the data is scattered and difficult to compare so there remains the challenge of having a unifying bioinformatics resource where integrating and effectively querying data from numerous sources are available. Here we present yStress, a Yeast stress microarray database aimed to facilitate exploration of cross-platform and cross-laboratory stress microarray data. In addition, our platform allows meta-analyses, combining microarray data from related studies to identify differentially expressed genes, which can enhance statistical power, reliability and generalization of the results. The database collects the results from differential expression analysis and gene set analysis for both single microarray analysis and meta-analysis. A user-friendly web interface and interactive visualization are provided to display the queried data and results.
top
Subject: Other
Presenting Author: Jonathan Wilkes, University of GlasgowAuthor(s):
Richard McCulloch, University of Glasgow,
Abstract:
The protozoan parasite Trypanosoma brucei utilises a RNA interference (RNAi) pathway, widely conserved with other eukaryotes. This can be adapted to regulate expression of the poly-cistronically transcribed genes of T. brucei, utilising gene-specific sequences within a tetracyclin inducible cassette, allowing RNAi 'knockdown'; now an important research tool. RIT-seq has been developed, which enabled the parallel analysis of >8000 genes in T.brucei in life-cycle and differention stages (1). The original RITseq methodology possesses a number of shortcomings which compromise its potential: semi-specific PCR produces small enrichments of the inserted sequences, produces inconsistent amplified sequences, and contains significant genomic sequence unrelated to the inducible fragments.
We have designed an adaptation of the methodology involving a specific PCR to amplify sequences between common primer sites flanking inserted genomic fragments in the RNAi cassette. Preparing the sequencing library from his amplified material requires 10fold less material (500ng of DNA), produces a higher proportion (3-10fold) of reads unequivocally derived from the cassettes, utilises standard protocols for library preparation and permits sample multiplexing. To validate this RITseq approach, we have screened for T.brucei genes that act in DNA damage repair by measuring read abundance after RNAi in the presence or absence of the SN2 alkylator methyl methanesuplhonate. A number of previously characterised T. brucei DNA repair genes are revealed, and several novel pathways that have not been examined to date. The system was adapted to produce a comprehensive panel of protein kinase (kinome) probes.
1) Alsford et al. Genome Res. 2011 Jun;21(6):915-24.
top
Subject: Metogenomics
Presenting Author: Tangjie Zhang, Yangzhou University
Abstract:
To research relationship between genetic variation and life-history variables of Actinopterygii, as indicated by common length, maximum length, maximum weight and longevity, and environmental variation, as indicated by three different fishes’ living environments, we applied analysis of independent regression and phylogenetically-independent contrasts methods to evaluate life-history variables correlations with rps7 neutral genetic diversity. Polymorphism datasets of rps7 gene, belonging to 48 genera, 25 families and 9 orders, of Actinopterygii, was obtained from Polymorphix and Popset of GenBank. Life-history variables were obtained from the AnAge database and fishbase. The results showed that neutral genetic diversity of fishes is significantly negatively associated with common length. No strong level of correlation was found between fish’s neutral genetic diversity and maximum size, maximum weigh or maximum longevity. No correlation was found among neutral genetic diversity and fishes’ habits (marine, freshwater and marine-freshwater) either.
top