13th Annual Rocky Mountain Bioinformatics Conference

ORAL PRESENTATIONS

OP01

Automatic mining of user reviews might reveal potentially unsafe nutritional products

Presenting Author: Graciela Gonzalez Hernandez, Arizona State University

Abstract:
Although dietary supplements are widely used and generally are considered safe, some supplements have been identiﬁed as causative agents for adverse reactions, some of which may even be fatal. The Food and Drug Administration (FDA) is responsible for monitoring supplements and ensuring that supplements are safe. However, current surveillance protocols are not always eﬀective. Leveraging user-generated textual data, in the form of Amazon.com reviews for nutritional supplements, we use natural language processing techniques to develop a system for the monitoring of dietary supplements. We use topic modeling techniques, speciﬁcally a variation of Latent Dirichlet Allocation (LDA), and background knowledge in the form of an adverse reaction dictionary to score products based on their potential danger to the public. Our approach generates topics that semantically capture adverse reactions from a document set consisting of reviews posted by users of speciﬁc products, and based on these topics, we propose a scoring mechanism to categorize products as “high potential danger”, “average potential danger” and “low potential danger.” We evaluate our system by comparing the system categorization with human annotators, and we ﬁnd that the our system agrees with the annotators 69.4% of the time. With these results, we demonstrate that our methods show promise and that our system represents a proof of concept as a viable low-cost, active approach for dietary supplement monitoring.

TOP

OP02

Regulatory network inference: use of whole brain- vs brain region-specific gene expression data in the mouse

Presenting Author: Ronald Taylor, Pacific Northwest National Laboratory

Abstract:
The incidence of Alzheimer’s disease (AD) is on the ascendancy. However, its etiology remains only partially resolved. Transcriptional Regulatory Networks (TRNs) have tremendous potential to offer insights. There is a view that brain region specificities of gene expression regulation [1,2] limit the usefulness of TRNs derived from whole brain gene expression data. Here, we test that assumption. We determine what fraction of the regulatory connections seen for a selected functionally related group of genes, inferred from expression data from a small subset of brain regions, are also inferred when expression data is used from across the entire brain. We report which brain region-specific transcription factor – target relationships are so strong as to be discernible when whole brain data are used for network inference. TRNs derived from the hippocampus, a relevant brain region because of its association with learning and memory, are of particular interest. We generate, explore, and compare TRNs of genes relevant to neurodegeneration generated using in situ hybridization data from Allen Brain Atlas hippocampus regions (“hippocampal_region”, “hippocampal_formation”, “dentate_gyrus”, “ammon's_horn” and “subiculum”) on the one hand, and all brain regions together on the other. TRNs were learned using a high-performing network inference algorithm: Context Likelihood of Relatedness (CLR).

References
[1] W. Sun, et al., Transcriptome atlases of mouse brain reveals differential expression across brain regions and genetic backgrounds, G3 (Bethesda). 2 (2012) 203-211.
[2] M. Caracausi, et al., A quantitative transcriptome reference map of the normal human hippocampus, Hippocampus. (2015).

TOP

OP03

MyVariant.info: community-aggregated variant annotations as a service

Presenting Author: Jiwen Xin, The Scripps Research Institute

Co-Author(s):

Adam Mark, Avera McKennan Hospital & University Health Center
Sean Mooney, University of Washington
Ben Ainscough, Washington University in Saint Louis
Ali Torkamani, The Scripps Research Institute
Chunlei Wu, The Scripps Research Institute
Andrew Su, The Scripps Research Institute

Abstract:
The accumulation of genetic variant annotations has been increasing explosively with the recent advance of the high-throughput sequencing technology.These annotations are sprinkled across dozens of data repositories like dbSNP, dbNSFP, ClinVar, and hundreds of other specialized databases. While the volume and breadth of annotations is valuable, their fragmentation across many data silos is often frustrating and inefficient. Bioinformaticians everywhere must continuously and repetitively engage in data wrangling in an effort to comprehensively integrate knowledge from all these resources, and these uncoordinated efforts represent an enormous duplication of work. The problem of fragmentation is exacerbated (perhaps even fundamentally caused) by the inability of data providers to efficiently contribute to existing repositories. As a result, annotation providers must generate new resources in order to host newly-generated annotations that are unavailable in the central repositories. In order to tackle with this problem, we created a platform, called MyVariant.info, to aggregate variant-specific annotations from community resources and provide high-performance programmatic access. Annotations from each resource are first converted into JSON-based objects with their id fields as the canonical names following HGVS nomenclature (genomic DNA based). This scheme allows merging of all annotations relevant to a unique variant into a single annotation object. A high-performance and scalable query engine was built to index the merged annotation objects and provides programmatic access to the developers.

TOP

OP04

Flowr: Robust and efficient workflows using a simple language agnostic approach

Presenting Author: Sahil Seth, MD Anderson Cancer Center

Co-Author(s):

Jianhua Zhang, MD Anderson
Xingzhi Song, MD Anderson
Xizeng Mao, MD Anderson
Huandong Sun, MD Anderson
Andrew Futreal, MD Anderson
Samir Amin, MD Anderson

Abstract:
Motivation: Bioinformatics analyses have increasingly become a compute intensive process, with lowering costs and increasing number of samples. Each lab spends time in creating and maintaining a set of pipelines, which may not be robust, scala-ble or efficient. Further, different compute environments across institutions hinder collaboration and portability of anal-yses pipelines.
Results: Flowr is a robust and scalable framework for designing and deploying computing pipelines in an easy-to-use fashion. It implements a scatter-gather approach using computing clusters, trivializing the concept to the use of five simple terms (in submission and dependency types). Most importantly it is flexible, such that customizing existing pipelines is easy and since it works across several computing environments (LSF, SGE, Torque and SLURM) it is portable. Using flowr, analyses of raw read reads to somatic variants can be achieved in about 2 hours (from about 24).

TOP

OP05

In-silico identification of prognostically inversely correlated miRNA - mRNA pairs in multiple cancers

Presenting Author: Chirayu Goswami, Thomas Jefferson University

Co-Author(s):

Zi-Xuan Wang, Thomas Jefferson University

Abstract:
Despite numerous methods available to identify potential mRNA targets for miRNAs, prognostic relationship of these molecules in diseases like cancers where deregulation of gene expression is a major pathogenic factor, has not been emphasized yet. We performed in-silico identification of prognostically inversely correlated miRNA - mRNA pairs (PIC’s) in multiple cancers using expression data from The Cancer Genome Atlas. Partners in a PIC show inverse correlation of expression and opposite hazard implication. Using a three step approach, we identified a total of 1,201,301 PIC’s from 23 cancer types, several of which have previously been shown to have a predicted or experimentally validated relationship. A maximum 375,621 PICs were identified in Lower Grade Gliomas, while a minimum 300 PICs were identified in Prostate adenocarcinoma. Four miRNA-mRNA pairs were identified as PICs in 7 different cancer types. Two miRNA-mRNA pairs were identified as PICs in 5 different cancer types where mRNA is also a validated target of miRNA. Organ specific analysis was performed to identify PICs common to cancers from same or related tissue of origin. We have also developed a database PROGTar for hosting our analysis results. PROGTar is available freely for non-commercial use at www.xvm145.jefferson.edu/progtar. We believe our method and analysis results will provide a novel prognostically relevant, pan-cancer perspective to study of miRNA-mRNA interactions and miRNA target validation.

TOP

OP06

A framework for reproducible computational research

Presenting Author: Apua Paquola, Salk Institute for Biological Studies

Co-Author(s):

Jennifer Erwin, Salk Institute for Biological Studies
Fred Gage, Salk Institute for Biological Studies

Abstract:
Analysis of high-troughput biological data often involves the use of many software packages and in-house written code. For each analysis step there are multiple options of software tools available, each with its own capabilities, limitations and assumptions on the input and output data. The development of bioinformatics pipelines involves a great deal of experimentation with different tools and parameters, considering how each would fit to the big picture and the practical implications of their use. Without structure and good practices, combining experimentation with reproducibility could prove challenging. In this work we present a set of methods and tools that enable the user to experiment extensively, while keeping analyses reproducible and organized. The framework centers on a container structure designed to organize analysis steps with the goal of reproducibility, explicitly separating user-generated data from computer-generated data. It also provides version control of code, documentation and data, dependency tracking, logging and automated execution of multiple analysis steps.

TOP

OP07

GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer

Presenting Author: Morteza Chalabi, SOUTHERN UNIVERSITY OF DENMARK (SDU)

Co-Author(s):

FABIO VANDIN, UNIVERSITY OF PADOVA

Abstract:
Recent advances in next generation sequencing data have allowed the collection of somatic mutations from a large number of patients from several cancer types. One of the main challenges in analyzing such large datasets is the identification of the few driver genes having mutations that are related to the disease. This task is complicated by the fact that genes and mutations do not act in isolation, but are related by a complex interaction network. A related but unexplored challenge is the classification of the cancer type using somatic mutations, that may be relevant for early cancer detection from circulating tumor DNA or circulating tumor cells.

We propose a graph-diffused SVM (GraDe-SVM) approach for cancer type classification using somatic mutations. Our approach effectively integrates somatic mutations and information from a large-scale interaction network using a graph diffusion process. We tested our method on a cohort of 3424 cancer samples from 11 cancer types from The Cancer Genome Atlas (TCGA) project, using both single nucleotide variants (SNVs) and copy number variants (CNVs). Our results show that our method improves the classification of the cancer type using somatic mutations compared to approaches that ignore the interaction network or consider the network but do not use the diffusion process. Moreover our approach highlights a number of known driver genes and genes with mutations that distinguish different cancer types.

TOP

OP08

High-grade serous ovarian cancer subtypes are similar across populations

Presenting Author: Gregory Way, University of Pennsylvania

Co-Author(s):

James Rudd, Geisel School of Medicine at Dartmouth College
Chen Wang, Mayo Clinic
Brooke Fridley, University of Kansas Medical Center
Gottfried Konecny, David Geffen School of Medicine at UCLA
Ellen Goode, Mayo Clinic
Casey Greene, University of Pennsylvania
Jennifer Doherty, Geisel School of Medicine at Dartmouth College

Abstract:
Three to four gene expression-based subtypes of high-grade serous ovarian cancer (HGSC) have been previously reported. We sought to determine the similarity of HGSC subtypes between populations by performing k-means clustering (k = 3 and k = 4) in five publicly-available HGSC mRNA expression datasets with >130 tumors. We applied a unified bioinformatics analysis pipeline to the five distinct datasets and summarized differential expression patterns for each cluster as moderated t statistic vectors using Significance Analysis of Microarrays. We calculated Pearson’s correlations of these cluster-specific vectors to determine similarities and differences in expression patterns. Defining strongly correlated clusters across datasets as “syn-clusters”, we associated their expression patterns with biological pathways using geneset overrepresentation analyses. Across populations, for k = 3, correlations for clusters 1, 2 and 3, respectively, ranged between 0.77-0.85, 0.80-0.90, and 0.65-0.77. For k = 4, correlations for clusters 1-4, respectively, ranged between 0.77-0.85, 0.83-0.89, 0.51-0.76, and 0.61-0.75. Results are similar using non-negative matrix factorization. Syn-cluster 1 corresponds to previously-reported mesenchymal-like subtypes, syn-cluster 2 to proliferative, syn-cluster 3 to immunoreactive, and syn-cluster 4 to differentiated. The ability to robustly identify correlated clusters across number of centroids, populations, and clustering methods provides strong evidence that at least three different HGSC subtypes exist. The mesenchymal-like and proliferative-like subtypes are the most consistent and could be uniquely targeted for treatment.

TOP

OP09

Machine learning and genomic analysis to predict drug resistance in Mycobacterium tuberculosis

Presenting Author: Gargi Datta, University of Colorado School of Medicine, National Jewish Health

Co-Author(s):

Rebecca Davidson, National Jewish Health
Sonia Leach, University of Colorado School of Medicine, National Jewish Health
Michael Strong, University of Colorado School of Medicine, National Jewish Health

Abstract:
Tuberculosis, caused by Mycobacterium tuberculosis is the second leading cause of death due to an infectious disease. While the incidence of TB cases is declining, an upsurge of drug-resistant strains of M. tuberculosis is a global cause for concern. Understanding the mechanisms associated with TB drug resistance development and quick recognition of resistant strains is critical to limiting the spread of drug resistance disease. We hypothesize that a combination of genotyping and machine learning provides an accurate and efficient way to identify drug-resistance. We have created a fully automated sequence analysis and mutation identification pipeline to identify mechanisms associated with the development of drug resistance. Our training set includes 3502 M. tuberculosis genome sequences with phenotypic susceptibility information from publicly available sources. To characterize existing mutations, we feed these sequences through our mutation analysis pipeline. We have created a diverse feature set that includes different feature types and data types. To combine these different feature and data types into a non-redundant and informative feature set, we are working on a novel way to handle feature selection with mixed data with an existing simultaneous feature selection and classification algorithm algorithm for sparse and imbalanced genomic data, that uses a combination of model based and instance based methods for classification. Finally, we aim to create a publicly available web and mobile application to facilitate fast delivery of drug-resistance profiles to researchers and clinicians.

TOP

OP10

Venom Peptides as Therapeutic Agents: Can we use Phylogenetics to Inform Drug Discovery?

Presenting Author: Joseph Romano, Columbia University

Co-Author(s):

Nicholas Tatonetti, Columbia University

Abstract:
INTRODUCTION
Animal venoms have been used for therapeutic purposes since the dawn of recorded history, and present-day researchers view them as a largely untapped resource for drug discovery. Techniques for discovery and prediction of biological therapeutic agents utilize phylogenetic methods in various ways, but there are conflicting reports as to whether venom peptide phylogenies actually reflect speciation (and, consequently, whether venom peptide phylogenies are informative). In this study, we use phylogenetic distance metrics and modern informatics techniques to demonstrate that venom peptide sequences do not recapitulate speciation.

METHODS
Our methods include various high-throughput computational techniques that are known collectively as “phylogenetic simultaneous analysis”. We constructed phylogenetic trees for families of orthologous peptide sequences, and computed the aforementioned simultaneous analysis metrics between each protein phylogeny and a “reference tree” from mitochondrial proteins, grouping venom peptide and non-venom peptide families together. Finally, we performed both parametric and empirical hypothesis tests to determine whether the computed values were substantially different between the two peptide classes (venom and non-venom).

RESULTS
The distributions of simultaneous analysis values across all venom peptide families are substantially different (p-value < 0.05 in all cases) from those generated using non-venom peptide families. Therefore, we can infer that venom phylogenies do not recapitulate accepted patterns of speciation.

CONCLUSIONS
Our results strongly suggest that venom-based drug discovery and repurposing should not rely on evolutionary lineages of venomous species until the causes of this observation are better understood. We also offer a number of potential explanations for this peculiar behavior.

TOP

OP11

Abstract Withdrawn

OP12

Visualizing Robustness of Complex Phenotype and Biomarker Associations

Presenting Author: Michael Hinterberg, University of Colorado-Denver

Co-Author(s):

David Kao, University of Colorado-Denver
Larry Hunter, University of Colorado-Denver
Carsten Görg, University of Colorado-Denver

Abstract:
Biomarker discovery in clinical medicine is important for distinguishing populations, as well as understanding the etiology of disease. Increasingly finer-grained complex disease phenotypes and patient subpopulations are being defined through combinations of multiple clinical variables. Any set of biomarkers associated with complex phenotypes should not only be biologically plausible, but also sufficiently robust so as to be replicated in further study in additional populations. These requirements create a challenge in simultaneously optimizing statistical association, model complexity, and robustness. In previous work, we developed PEAX (Phenotype-Expression Association eXplorer), an open-source tool that allows rapid visual exploration and analysis of complex clinical phenotype and gene expression association. Through use-case studies and feedback, we identified visualizations and algorithms to guide the user toward additional insight of meaningful biological associations. Notably, when a user defines a complex, multivariate phenotype, the search space surrounding individual clinical features may be interesting, and is often manually explored. We have extended PEAX to calculate and display additional association models; we use small multiples to represent incremental refinements to the user-defined model, and thus guide the analyst through the process of refining models and understanding the overall robustness of models. We demonstrate the utility of our approach for modeling robustness through examples of visualizations using real and simulated data.

TOP

OP13

Computational methods to analyze HNSCC samples with immune response

Presenting Author: Ashok Sivakumar, Johns Hopkins University

Co-Author(s):

David Masica, Johns Hopkins University

Abstract:
The scope of this work encompasses head and neck squamous cell carcinoma (HNSCC) samples, which is the sixth most frequent cancer worldwide. Our objective is to evaluate the potential of combining DNA sequencing and computational analysis to quickly and inexpensively indicate whether immune checkpoint inhibitor (ICI) therapy will be effective for a particular cancer patient. While ICIs are a promising therapy for many cancer types, the response rate is limited to ~20% in HNSCC. Therefore, our analysis is targeted toward examining missense mutations that are statistically correlated with immune-cell infiltrated tumors. More specifically, our approach is built upon software and statistical methods to determine which sequence-specific epitopes are among the most abundant in infiltrating T-cells. We examine and compare several statistical results and approaches from a machine learning perspective. It is recognized that identifying accurate responders using such methods could be of immense financial value for patient costs and confirm the benefit of precision medicine. Finally, we present a computational framework from this HNSCC population analysis for future development, which can be subsequently scaled to other cancers that are part of The Cancer Genome Atlas (TCGA).

TOP

OP14

Abstract Withdrawn

OP15

REPdenovo: Inferring de novo repeat motifs from short sequence reads

Presenting Author: Yufeng Wu, University of Connecticut

Co-Author(s):

Chong Chu, University of Connecticut
Rasmus Nielsen, University of California, Berkeley

Abstract:
Repeat elements are important components of eukaryotic genomes. Sequencing data from many species are now available, providing opportunities for finding and comparing genomic repeat activity among species. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often are missing data in highly repetitive regions that are difficult to assembly. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. We apply the method to human data and discover a number of new repeats sequences that have been missed by previous repeat annotations. By aligning these discovered repeats to Pacbio long reads, we confirm the existence of these novel repeats. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families.

TOP

OP16

Molecular Modeling, dynamics and Virtual Screening studies to identify potent CLDN-4 inhibitors

Presenting Author: Jayanthi Sivaraman, VIT University

Abstract:
Claudins are small integral membrane proteins that are present in the tight junctions (TJs). Claudin family comprises of 24 subtypes and their distribution and expression levels are known to vary in different organs and tissues. Among them Claudin-4 (CLDN-4) was found to be up-regulated in various malignancies including colon, prostate, breast, gastric, ovarian and pancreatic cancers and is a potential target for antitumor therapy.

In the present work, the 3D structure of Claudin-4 was predicted using the homology modeling method (Template - PDB ID: 4P79) using Schrodinger. The modeled structure of CLDN-4 was further validated and it confirms with the basic structure of claudin family proteins with four Trans membrane domains and two extracellular loops and a cytosolic loop. The molecular dynamics (MD) simulations suggest that the modeled CLDN-4 structure was stable. This CLDN-4 model was used for virtual screening against NCI database containing 265241 compounds. Virtual screening was performed using Schrodinger. Based on the GLIDE docking energy scores, it was found that top three ligands namely NCI110039, NCI344682, and NCI661251 were having lower energy scores (-9.3, -9.0 and -9.6 kcal/mol respectively) which reveal higher binding affinity towards the active towards the active site of CLDN-4. The MD Studies for protein-ligand complexes are in progress. These ligands might act as potential Inhibitors for CLDN-4 which are up regulated in various cancers. However, pharmacological studies are required to confirm the inhibitory activity of these ligands against CLDN-4.

TOP

OP17

A comparison of genetically matched cell lines reveals the equivalence of human iPSCs and ESCs

Presenting Author: Soohyun Lee, Harvard Medical School

Co-Author(s):

Jiho Choi, Cancer Center and Center for Regenerative Medicine, Massachusetts General Hospital
Kendell Clement, Broad Institute
William Mallard, Broad Institute
Guidantonio Malagoli Tagliazucchi, University of Modena and Reggio Emilia
Hotae Lim, Johns Hopkins University School of Medicine
In Young Choi, Johns Hopkins University School of Medicine
Francesco Ferrari, Harvard Medical School
Alex Tsankov, Broad Institute
Romona Pop, Harvard Stem Cell Institute
Gabsang Lee, Johns Hopkins University School of Medicine
John Rinn, Broad Institute
Alexander Meissner, Harvard Stem Cell Institute
Peter Park, Harvard Medical School
Konrad Hochedlinger, Cancer Center and Center for Regenerative Medicine, Massachusetts General Hospital

Abstract:
The equivalence of human induced pluripotent stem cells (hiPSCs) and human embryonic stem cells (hESCs) remains controversial. Here we use genetically matched hESC and hiPSC lines to assess the contribution of cellular origin (hESC vs. hiPSC), the Sendai virus (SeV) reprogramming method and genetic background to transcriptional patterns while controlling for cell line clonality and sex. We find that transcriptional variation originating from genetic background dominates over variation due to cellular origin or SeV infection. Moreover, the 49 differentially expressed genes we detect between genetically matched hESCs and hiPSCs neither predict functional outcome nor distinguish an independently derived, larger set of unmatched hESC and hiPSC lines. We conclude that hESCs and hiPSCs are molecularly and functionally equivalent and cannot be distinguished by a consistent gene expression signature. Our data further imply that genetic background variation is a major confounding factor for transcriptional comparisons of pluripotent cell lines, explaining some of the previously observed expression differences between genetically unmatched hESCs and hiPSCs.

TOP

OP18

Examination of risk factors for nontuberculous mycobacterial infections among National Jewish Health hospital patients in the United States

Presenting Author: Ettie Lipner, University of Colorado Denver

Co-Author(s):

David Knox, University of Colorado, Denver
Joshua French, University of Colorado, Denver
Jordan Rudman, Colorado College
Michael Strong, National Jewish Health

Abstract:
Nontuberculous mycobacterial (NTM) disease has emerged as an increasingly prevalent infectious disease, and its prevalence has been increasing across the United States, particularly over the last two decades. Prevalence of NTM varies by geographic region, but the geospatial factors influencing this variation remain unknown. The objective of our study is to identify spatial clusters of NTM disease at the zip code level and to identify associated variables of interest. NTM case data were obtained from the National Jewish Health’s (NJH) Electronic Medical Records (EMR). Zip code level NTM case counts and age-adjusted population data were uploaded to SaTScan to identify high-risk spatial clusters of NTM disease across the US. Poisson regression models are used to estimate associations of climatic, environmental, and socio-economic with NTM infection risk.

TOP

OP19

Detection and interpretation of extrachromosomal microDNAs from next-generation sequencing data

Presenting Author: Mark Maienschein-Cline, University of Illinois at Chicago

Co-Author(s):

Pinal Kanabar, University of Illinois at Chicago
Stefan Green, University of Illinois at Chicago
Chunxiang Zhang, Rush University

Abstract:
Extrachromosomal microDNAs are short, circular DNA molecules derived from genomic DNA. They are typically hundreds of nucleotides long, and appear to be omnipresent in mammalian cells. However, their mechanism of formation and function in cells is only beginning to be understood. A major roadblock in the development of microDNA studies is the lack of a robust computational methodology for detecting them from next-generation sequencing (NGS) data, and a clear path to interpreting their presence in cells. Confounding these problems is the extremely low molecular reproducibility observed for microDNAs, where biologically replicated experiments turn up very few identical microDNAs. I will present solutions to both challenges: first, I will describe a systematic, flexible bioinformatics pipeline for detecting microDNAs in NGS data. Second, I will address the low molecular reproducibility by showing how a systems-based interpretation of microDNAs can both substantially increase the concordance between biological replicates, as well as distinguish different conditions from each other.

TOP

OP20

A genetic analysis of a complex trait in a “genetically intractable” gut microbe

Presenting Author: Sena Bae, Duke University

Abstract:
Microbes mediate immune and nutrient homeostasis in the vertebrate gastrointestinal tract. The molecular basis for these host-microbe interaction is poorly understood as many gut microbes are not amenable to molecular genetic manipulation. We combined phenotypic selection after chemical mutagenesis with population-based whole genome sequencing to identify genes that are required for motility in the firmicute Exiguobacterium, a component of the vertebrate gut microbiota that contributes to lipid uptake. We derived strong associations between the loss of motility and mutations in predicted Exiguobacterium motility genes and genes of unknown function. We confirmed the genetic linkage between the predicted causative mutations and loss of motility by identifying suppressor mutations that restored motility. These results indicate that a genetic dissection of complex traits in microbes can be readily accomplished without the need to develop molecular genetic tools.

TOP

OP21

iSeGWalker a easy handling de novo genome reconstruction dedicated to small sequence

Presenting Author: Benjamin Saintpierre, Institut Pasteur

Co-Author(s):

Johann Beghain, Institut Pasteur
Eric Legrand, Institut Pasteur
Anncharlott Berglar, Institut Pasteur
Deshmukh Gopaul, Institut Pasteur
Frédéric Ariey, Institut Cochin

Abstract:
Most of “de novo softwares” are global assemblers, meaning they work on the assembling of all reads from a sequencing file. They are not adapted to get a short sequence as the non nuclear DNA from a pathogen. Here we present a Perl software, iSeGWalker (in silico Seeded Genome Walker), developed to accomplish a quick de novo reconstruction of a region, by « genome walking » on Next Generation Sequencing (NGS) data.
The first step of the process is to determine an initial seed, which must be unique and specific to the targeted region. The second step is a cycle with the seed search step (an exact-matching reads selection), the alignment of all selected reads and the generation of a consensus sequence. Once the consensus obtained, a new seed, composed by the 30 last consecutive nucleotides, is obtained and a new cycle is performed.
We tested our software using an apicoplaste seed on a Fastq file obtained from Illumina’s Plasmodium falciparum 3D7 reference strain sequencing. We were able to identify the whole complete genome of the apicoplast including a non-published region harboring a balanced polymorphism that may have a function in the regulation and/or division of the falciparum apicoplast genome.

TOP

OP22

Spatial modeling of drug delivery routes for treatment of disseminated ovarian cancer

Presenting Author: Kimberly Kanigel Winner, University of Colorado

Co-Author(s):

Mara Steinkamp, University of New Mexico
Rebecca Lee, University of New Mexico
Maciej Swat, Indiana University
Carolyn Muller, University of New Mexico
Melanie Moses, University of New Mexico
Yi Jiang, Georgia State University
Bridget Wilson, University of New Mexico

Abstract:
In ovarian cancer, metastasis is typically confined to the peritoneum, and requires surgical removal of the primary tumor and macroscopic secondary tumors. More effective strategies are needed to target microscopic spheroids that persist after debulking surgery. To treat this residual disease, therapeutic agents can be administered by either intravenous (IV) or intraperitoneal (IP) infusion. We use a cellular Potts model to compare tumor penetration of two classes of drugs (small- and large-molecule) when delivered by these two alternative routes. Experimental measures included fluorescence recovery after photobleaching (FRAP) to measure penetration of non-specific antibodies into cultured human ovarian cancer (SKOV3.ip1) spheroids, as well as two-photon imaging of explanted tumors from orthotopic xenografts in nude mice. Stereology analysis was used to estimate the range of vascular densities in disseminated tumors from patients. The model considers the primary route when drug is administered either IV or IP, as well as the subsequent exchange into the other delivery volume as a secondary route. By accounting for these dynamics, the model shows that IP infusion is the markedly better route for delivery of both small molecule and antibody therapies for microscopic, avascular tumors typical of patients with ascites. Small tumors attached to peritoneal organs, ranging from vascularity of 2-10%, also show superior drug delivery via the IP route. Use of both delivery routes may provide the best total coverage of tumors, particularly when there is a significant burden of avascular spheroids suspended in the peritoneal fluid as well as larger, neo-vascularized secondary tumors.

TOP

OP23

Hypothesis independent test development using mass spectrometry data from patient clinical groups reveals underlying biological pathways

Presenting Author: Krista Meyer, Biodesix, Inc

Co-Author(s):

Heinrich Roder, Biodesix, Inc
Julia Grigorieva, Biodesix, Inc
Kevin Sayers, Biodesix, Inc
Senait Asmellash, Biodesix, Inc

Abstract:
Identification of biomarkers for stratifying patient sub groups based on diagnosis, treatment selection, measuring response, etc., in many if not most diseases is an unmet need. Seemingly limitless publications report potential biomarkers that are based on our current understanding of biological pathways involved in the disease, while few are ever validated and translated into clinical practice. Our method does not rely on a molecular understanding of the disease state or a particular hypothesis for biomarkers that participate in the disease. We use mass spectrometry data collected from patient serum samples in combination with robust analytical methods, the so called Diagnostic CortexTM data analytics platform, to enable the discovery and validation of multivariate tests without any preliminary hypothesis. This approach has many benefits including the streamlined and rapid process from discovery to test commercialization. Indeed, it could be argued that high clinical utility is the most important factor required. However, this argument does not satisfy the desire to understand the “how and why” a test works.
Through several approaches, we have discovered that mass spectral data and classification labels assigned by a test can enrich our understanding of the proteins and biological pathways that allow a test to distinguish patient groups. We will present examples of these methods and the results. While test development is hypothesis-independent, the same data can be useful for the generation of new hypotheses that lead to a greater understanding of the disease states.

TOP

OP24

Alignment-free approach for reads classification within a single metagenomic dataset

Presenting Author: Lusine Khachatryan, Leiden University Medical Centre

Co-Author(s):

Seyed Yahya Anvar, Leiden University Medical Centre
Peter de Knijff, Leiden University Medical Centre
Johan den Dunnen, Leiden University Medical Centre
Jeroen Laros, Leiden University Medical Centre

Abstract:
Due to advances in sequencing technologies, the human microbiome is becoming an increasingly informative source for scientific researches. A proper comparison by deep analysis of skin bacterial communities would make a valuable contribution to many fields, notably forensics. The analysis of metagenomic samples usually involves mapping reads to known genes or pathways and comparing the obtained profiles. However, microbial communities are usually complex and contain mixtures of hundreds to thousands of unknown bacteria which affect the accuracy and completeness of alignment-based approaches.
We developed an alignment-free approach (kPal) that is based on k-mer frequencies to assess the
level of similarity between and within metagenomic datasets. Previously our method was successfully applied for the comparison of different metagenomic samples. Recently we shown that k-mer based approach also can be utilized for classifying reads within a single metagenomic dataset. To illustrate this statement we used simulated and real data from a single molecule real time sequencer (PacBio) which provides long reads (500–20,000 bp). We have shown that k-mer frequencies can reveal the relationships between reads within a single metagenome, leading to a clustering per bacteria. This approach may potentially be used to estimate the metagenome complexity and to simplify the subsequent assembly procedure.

TOP

OP25

The Virome of Red Sea Brine Pool Sediments

Presenting Author: Sherry Aziz, American University in Cairo

Co-Author(s):

Mustafa Adel, American University in Cairo
Ramy K. Aziz, Faculty of Pharmacy, Cairo University
Rania Siam, American University in Cairo

Abstract:
Egypt’s Red Sea brine pools are unique environments owing to their high temperature, salinity and heavy metal levels. Although the microbiomes of the brine pool sediments have been well-characterized on the bacterial level, their viromes remain unexplored. Previous viral metagenomic analyses revealed tremendous diversity that needs more sampling on a global scale. Thus, we sought to determine the Red Sea brine sediment viromes and compare them to studied marine and sediment viromes. Since different viral analysis approaches lead to different results, and each has its strengths and limitations, we implemented three different bioinformatic tools: MEGAN, MG-RAST and GAAS. The combination of PhAnToMe database with GAAS analysis reduced the number of unassigned sequences, and increased the numbers of assigned phages from 200-250 to 850-2000,and controlled sampling bias via genome length normalization. Analysis of the viromes of 14 Red Sea brine pools and two non-brine control sediments showed a universal marine signature reflected by a dominance of different phages of Prochlorococcus, Synechococcus and several Mediterranean phages. The deepest two layers of the Red Sea (ATII-1 and ATII-2) were the most divergent, with the lowest alpha-diversity (low richness and evenness). Yet, these two meataviromes have their own signature characterized by higher abundance of gokushoviruses (e.g, Gokushovirus isolate GOM and Gokushovirinae Fen672_3) than other sections. Further comparative analysis will be performed to investigate how the extreme conditions of these brine pools impacted their viromes as previous studies showed that different ecosystmes’ stressors results in different genome lengths.

TOP

OP26

Genomic Big Data: scalability challenges and solutions

Presenting Author: Faraz Faghri, University of Illinois at Urbana-Champaign

Co-Author(s):

Roy Campbell, University of Illinois at Urbana-Champaign
Sayed Hadi Hashemi, University of Illinois at Urbana-Champaign
Mohammad Babaeizadeh, University of Illinois at Urbana-Champaign

Abstract:
Genomics plays a role in nine of the 10 leading causes of death in the United States. For people who are at increased risk for hereditary breast and ovarian cancer, or hereditary colorectal cancer, genetic testing may reduce illness risks by guiding evidence-based interventions. Such interventions involve the emergent practice of precision medicine that uses an individual’s genetic profile to guide decisions made in regard to the prevention, diagnosis, and treatment of disease. At the nexus of precision medicine and computer science – cloud computing and machine learning – lies many research challenges for adapting and optimizing data-driven analytics to change the medical care delivered to patients in the US and beyond those borders. Focused on high-speed data analytics on large clusters for genomic data, our research applies scalable algorithms, new storage and computation designs, and aims to achieve the possibilities of precision medicine with significant improvements in performance.

In this work we visit four major challenges facing big data genomics: data acquisition, data storage, data distribution, and data analysis. We present our solutions for privacy-preserving data distribution and scalable data analytics.

TOP

OP27

Identification of chromatin accessibility from nucleosome occupancy and methylome sequencing

Presenting Author: Yongjun Piao, Chungbuk National University

Co-Author(s):

Seongkeon Lee, Sungshin Women's University
Keith D. Robertson, Center for Individualized Medicine, Mayo Clinic
Huidong Shi, Georgia Regents University
Keun Ho Ryu, Chungbuk National University
Jeong-Hyeon Choi, Georgia Regents University

Abstract:
Chromatin is a fundamental structure for compactly packaging a genome and reducing its volume in eukaryotic cells. The nucleosome is the basic repeating unit of chromatin and it is composed of ~145bp DNA wrapped around histone proteins. Positioning of nucleosomes throughout the genome, also known as nucleosome occupancy, plays a crucial role in epigenetic regulation of gene activation and silencing. It is well known that nucleosome positioning influences DNA methylation and histone modifications such as methylation, acetylation, and phosphorylation. Recently, nucleosome occupancy and methylome sequencing (NOMe-seq) has been developed to allow simultaneously profiling chromatin accessibility and DNA methylation on single molecules. However, to our best knowledge, there is no standard method for de novo identification of nucleosome occupancy from NOMe-seq data. In this paper, we presented a novel algorithm for identifying nucleosome-occupied regions (NORs) based on seed-extension approach from NOMe-seq. The proposed algorithm first identifies seeds that are very likely GCHs in NORs, next extends seeds as long as the average of GCH methylation scores is smaller than a threshold, and finally decides the end point of the extended seeds using the predicted mean and standard deviation of methylation scores based on Gaussian mixed model. It also conducts statistical tests to assess the significance of identified NORs. The efficiency and effectiveness of the proposed algorithm were tested on simulated datasets, and the experimental results showed that the proposed method outperformed the existing methods and achieved sensitivity > 0.97 and specificity > 0.99.

TOP

OP28

Charting the human genome’s regulatory landscape with transcription factor binding site predictions

Presenting Author: Xi Chen, New York University

Co-Author(s):

Richard Bonneau, New York University/Simons Foundation

Abstract:
Transcription factor (TF) binding is an essential step in the regulation of gene expression. Differential binding of multiple TFs at key cis-regulatory loci allows the specification of progenitor cells into various cell types, tissues and organs. ChIP-Seq is a technique that can reveal genome-wide patterns of TF binding. However, it lacks the scalability to cover the range of factors, cell types and dynamic conditions a multicellular eukaryotic organism sees. So charting the regulatory landscape spanning multi-lineage differentiation requires computational methods to predict TF binding sites (TFBS) in an efficient and scalable manner.

We develop a method to predict binding sites for over 800 human TFs using a rich collection of DNA binding motifs. We integrate genomic features, including chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and the proximity of TF motifs to transcription start sites in sparse logistic regression classifiers. We label candidate motif sites with ChIP-Seq data and apply correlation-based filter and L1 regularization to select relevant features for each trained TF. Our models perform favorably in comparison to the current best TFBS prediction methods. Further, we map TFs based on feature distance to a nearest trained TF neighbor. This allows us to scale and expand the repertoire of putative TFBS to any TFs where motif data is available and to any cell types where accessibility data is obtainable. Our method has the potential to be applied in previously intractable domains to reveal the regulatory complexity of multicellular higher eukaryotes.

TOP

OP29

Recursive Indirect Paths Modularity (RIP-M) for Detecting Uniform Community Structure in RNA-Seq Co-Expression Networks

Presenting Author: Bahareh Rahmani, University of Tulsa

Co-Author(s):

Bill White, University of Tulsa
Brett McKinney, University of Tulsa

Abstract:
Clusters of genes in co-expression networks are commonly used as functional units for gene set enrichment detection and increasingly they are used as features (attribute construction) for statistical inference and classification. One of the practical challenges of using clusters for the purposes above is finding cluster sizes that are neither too large nor too small. Newman modularity automatically finds the number of communities, but for some networks, such as sparse networks, the modules are sub-optimal. For RNA-Seq co-expression networks, we show that modularity sometimes yields module sizes that are too extreme to resolve biological function. We develop a merging and splitting algorithm around Newman spectral modularity that allows the user to constrain the range of module sizes to prevent clusters from being so small that they do not capture genes in the relevant pathway and from being so large that they do not resolve distinct pathways. Our algorithm uses indirect path information to automatically re-assort genes between small clusters that may not have captured sufficient gene expression variation in the relevant pathway. We investigate the properties of this recursive indirect pathways modularity (rip-M) and compare it with other clustering methods using RNA-seq data from an influenza vaccine study. We compare methods based on enrichment of modules for immunologically relevant functional pathways and based on the association of modules with immune response phenotypes using gene set variation analysis to construct attributes from the modules.

TOP

OP30

Automatic Recovery of Toulmin Explanations from Full Text Papers

Presenting Author: Elizabeth White, UC Denver, AMC

Abstract:
Explanation is a central feature of scientific writing, whether it is used to construct a hypothesis for testing, to justify the use of one experimental method over another, or to place a conclusion into context with existing work in the field. One framework for describing arguments in general, and explanations in particular, is the Toulmin model, which enlists pieces of evidence, connected by explicit or implicit warrants, to provide a justification for a particular thesis. This model also incorporates the notion of posing contrary evidence for rebuttal, and it permits qualification of the scope and certainty of the statements it contains. These qualities make Toulmin’s framework well suited to scientific writing, where inconsistencies and disagreement surface as a matter of course. I begin by showing how Toulmin’s argumentation model maps to arguments in full text papers from the CRAFT dataset.
Using a set of papers from the CRAFT corpus, I then present features that a software can exploit to recognize the evidence, claims, scopes, and rebuttals of the Toulmin model. I assess these features’ effectiveness in allowing the code to accurately recover and classify Toulmin components automatically, and discuss where implicit parts of the argument complicate this recovery. Finally, I show how accurately the software can combine the argument components it has found to recover the entire explanation. This work is intended to be a pilot study for a framework that uses Toulmin’s model for summarization or question answering systems.

TOP

OP31

Tracking Cell Cycle Progression with Single Cell Resolution during T Cell Responses

Presenting Author: Andrey Kan, The Walter and Eliza Hall Institute

Co-Author(s):

Simone Oostindie, Wageningen University
Susanne Heinzel, The Walter and Eliza Hall Institute
Philip Hodgkin, The Walter and Eliza Hall Institute

Abstract:
In response to pathogenic stimuli, T lymphocytes undergo several rounds of division before they stop and return to quiescence. Deregulation of the response can lead to severe consequences for the host organism, from autoimmunity to immunodeficiency. It is increasingly evident that the study of such a complex biological phenomenon requires an interdisciplinary approach. We have recently used mathematical modelling to characterise T cell responses at the population level and demonstrated that the number of divisions before returning to quiescence (termed division destiny) is a key regulator of the response magnitude. Hence, mechanisms of inheritance of division destiny becomes a major question that is difficult to address using bulk culture experiments.
Here we investigate T cell responses on a single cell level. We use time lapse microscopy to follow anti-CD3 stimulated CD8+ T cells from fluorescent ubiquitination cell cycle indicator (FUCCI) reporter mice for several days without interruption. During imaging these cells are cultured in different concentrations of interleukin-2. Individual cells are tracked and lineage trees reconstructed annotated with cell cycle phases based on FUCCI fluorescence. We found a high clonal correlation in division destiny. Further, the data revealed differential effects of IL-2 whereby division destiny was strongly affected, and cell cycle duration remained unchanged. We used mathematical modelling to quantify the observed results. Our findings will support accurate predictions of lymphocyte expansion kinetics, that in turn can form the basis for next generation screening platforms for a range of drug therapies, including cancer immunotherapy.

TOP

OP32

Detection and disambiguation of geospatial locations for phylogeography

Presenting Author: Davy Weissenbacher, Arizona State University

Co-Author(s):

Tasnia Tahsin , Arizona State University
Rachel Beard, Arizona State University
Matthew Scotch, Arizona State University
Graciela Gonazalez, Arizona State University

Abstract:
Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles.

Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’).

Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences.

TOP

OP33

Reconstructing chromosome conformation by fluorescence microscopy

Presenting Author: Brian Ross, University of Colorado

Co-Author(s):

Fabio Anaclerio, University of Bari
Andrew Laszlo, University of Washington

Abstract:
The in-vivo conformation of chromosomes is increasingly recognized as an important regulator of gene expression. For example, enhancers contact promoters to increase gene expression; genes may physically move to transcription factories to produce mRNA; and structures in the nucleus such as the lamina and nuclear porins have been variously reported to down/upregulate gene expression. Unfortunately chromosome conformation has been difficult to measure directly; most present-day information comes indirectly through cell-averaged DNA-DNA interaction assays (3C and derivatives).

In principle, a chromosome could be straightforwardly reconstructed if a large number of genetic loci could be identified and localized using a microscope, simply by 'connecting the dots'. The problem is that whereas thousands of loci are required to capture the complex structure of a chromosome, microscopes can uniquely identify only a handful of different fluorescent labels by color, so there will be many indistinguishable loci having the same labeling color. Our solution to this problem is to use the known genetic separations of the labeled loci, along with a calibration between genetic distance and spatial distance, to help identify each spot in the microscope image. The conformation is encoded in pairwise mapping probabilities from imaged spots to genetic loci. Crucially, certain statistics of the mapping probabilities can blindly assess the quality of the reconstruction in comparison to color-scrambled control mappings. Aside from experimentally validating the reconstruction procedure and quality metrics, we demonstrate a methodology which will be needed in larger experiments involving hundreds or thousands of loci.

TOP

OP34

NGSCheckMate: Software for ensuring sample identity in next-generation sequencing studies, with or without alignmen

Presenting Author: Soohyun Lee, Harvard Medical School

Co-Author(s):

Sejoon Lee, Samsung Medical Center
Woong-Yang Park, Samsung Medical Center
Peter Park, Harvard Medical School
Eunjung Lee, Harvard Medical School

Abstract:
Next generation sequencing (NGS) has been widely adopted in biology and medicine. We often need to confirm that different batches of NGS data were derived from the same individual for accurate downstream analyses such as identification of somatic mutation specifically present in a certain tissue (e.g. cancer). Among all different steps of quality control, comparison of sequencing reads is the most direct and effective way to ensure the sample identity. Here, we developed NGSCheckMate, a freely available easy-to-use software, to identify NGS data from the same individual. It extracts, from sequencing reads before or after reference alignment, genetic profiles based on known population polymorphic single nucleotide variants (SNVs) from each sample. It distinguishes samples from the same individual by comparing correlations of genetic profiles to its pre-built classification model. Our performance evaluation demonstrates that NGSCheckMate can be generally applicable to diverse contexts of NGS studies and sequencing scope (whole genome (WGS), whole exome, RNA-seq, targeted sequencing and Chip-seq). The method is applicable to low sequencing depth (> 0.2X). We also provide a module that can be run on fastq files without alignment. Given that alignment may take days for large sequencing data, our alignment-free module can be useful for a quick initial check. It requires minimal memory and takes less than a minute for a standard RNA-seq data using a single core. We recommend using NGSCheckMate as a first step in any NGS study that requires sample identity quality control.

TOP

OP35

Improved network community structure improves function prediction

Presenting Author: Jooyoung Lee, Korea Institute for Advanced Study

Co-Author(s):

Juyong Lee, Korea Institute for Advanced Study
Steven Gross, University of California, Irvine,

Abstract:
We are overwhelmed by experimental data, and need better ways to understand large interaction datasets. While clustering related nodes in such networks—known as community detection—appears a promising approach, detecting such communities is computationally difficult. Further, how to best use such community information has not been determined. Here, within the context of protein function prediction, we address both issues. First, we apply a novel method that generates improved modularity solutions than the current state of the art. Second, we develop a better method to use this community information to predict proteins’ functions. We discuss when and why this community information is important. Our results should be useful for two distinct scientific communities: first, those using various cost functions to detect community structure, where our new optimization approach will improve solutions, and second, those working to extract novel functional information about individual nodes from large interaction datasets.

TOP