Posters
Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015.
To confirm your poster find the poster acceptence email there will be a confirmation link.
Click on it and follow the instructions.
If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Category B - 'Comparative Genomics'
B01 - A Clinical Information Management Platform for Semantic Exploitation of Clinical Data
Short Abstract: We present a platform which combines an approach to semantic extraction of medical information from clinical free-text documents with the processing of structured information from HIS records. The information extraction uses a fine-grained linguistic analysis, and maps the preprocessed terms to the concepts of domain-specific ontologies. These domain ontologies comprise knowledge from various sources, including expert knowledge and knowledge from public medical ontologies and taxonomies. The processes of ontology engineering and rule generation are supported by a semantic workbench that enables an interactive identification of those linguistic terms in clinical texts that denote relevant concepts. This supports incremental refinement of semantic information extraction.
TOP
B02 - Biotea: RDFizing PubMed Central in Support for the Paper as an Interface to the Web of Data
Short Abstract: Why should scholarly communication preserve, so conservatively,
practices that were thought for a different time and technology? In this paper, we present our approach to the generation of self-describing machine- readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/. Central our work it is the premise that Scholarly data and documents are of most value when they are interconnected rather than independent
TOP
B03 - Inference of the average fragment length in ChIP-seq data
Short Abstract: The ChIP-seq experimental procedure provides high-throughput genome-wide occupancy data for DNA-interacting proteins. The occupancy signal is obtained by isolating, sequencing and mapping millions of protein-bound DNA fragments. In proximity of a DNA binding site, the signal exhibits two peaks of comparable height, in the reference and reverse strand respectively, that surround the exact binding location. The shape of the peaks and the distance between them depend on the length distribution of the genomic fragments subject to sequencing, and they differ from protein to protein and from experiment to experiment. Even though these parameters are important for several downstream analyses, their estimation is non-trivial and calls for the development of specialized tools. We present an algorithm based on Expectation Maximization for inferring the average fragment length in single-end ChIP-seq experiments. Despite its simplicity we show that the tool is very accurate and can deal with poor data quality effectively.
TOP
B04 - RAMPART (a robust automatic multiple assembler toolkit)
Short Abstract: The de novo assembly of genomes from modern sequencing devices is a computationally intensive, multi-stage task. It is typical, after sequencing and quality assessment, that a pipeline involving read analysis, quality trimming, contig assembly, scaffolding and gap closing is executed in order to build a first pass assembly. Each step in the pipeline requires careful analysis and decision making before proceeding to the next step.
This poster presents RAMPART, a pipeline for automating the production of first pass assemblies. RAMPART supports a variety of assemblers and scaffolding tools, can assist with assembly validation and, if requested, make decisions automatically. Each step in the pipeline produces statistics and plots to help interpret, compare and visualise results. This assists the user to explain and justify decisions that were taken. An assembly validation step helps to assess the quality of the final assembly, which involves read alignment, feature response curves, and other novel validation techniques. Finally RAMPART can produce a final report describing the assembly process across all stages.
RAMPART is developed at TGAC and is built on top of a modified version of EBI’s Conan pipeline. TGAC’s modifications to Conan enable third party tools to be executed with or without a scheduling system, such as LSF or PBS. RAMPART is currently used at TGAC to run production jobs on TGAC’s computing cluster.
TOP
B05 - Unipo UGENE - pipelines for NGS data analysis
Short Abstract: On the World Map of High-throughput Sequencers (http://omicsmaps.com/) we can clearly see that NGS platforms are widely spread. As we know, NGS data require multistep computer analysis. So tools for that analysis are in high demand. There are actually high-quality programs like MACS, TopHat, SAMTools which can process NGS data. But these instruments are only a part of a complex data processing. Moreover they bound biologists to a specific operating system and require certain computer skills. Web-based platforms like Galaxy gather the instrument in pipelines with friendly interface, significantly simplifying computations. However, in web-based solutions, data uploading can slow down the whole pipeline. Additional expenses might be needed for servers or clouds to deploy such instruments. If the analysis is held off the laboratory the Internet connection may be an obstacle. Unipro UGENE (http://ugene.unipro.ru/) is an alternative desktop solution for biological data analysis. This is a free multiplatform open-source toolkit. It integrates popular algorithms and visualizations to provide a unified workspace. UGENE Workflow Designer runs different pipelines on desktop computers. So it does not require the Internet connection or additional hardware. Variant calling, RNA-seq, ChIP-seq pipelines based on popular tools and many others are integrated. These pipelines are available as samples and do not require special skills to run or customize them. The NGS pipelines are drawn as self-describing schemes and supplied with wizards to set the parameters. Results like assemblies, sequences, alignments are shown in a corresponding window. Users can pause a computational process and check intermediate data.
TOP
B06 - Novel algorithm to impute missing values and construct zero-recombinant haplotypes from genetic/allelic databases
Short Abstract: In this poster, we present a novel method to impute missing alleles from family-based, genetic databases with missing family members. The algorithm is based on a preliminary analysis of all possible combinations that may exist in the genotyping of a family, considering that each member should unequivocally have two alleles, one from each parent. The analysis was founded on the differentiation of seven cases, some of them divided into a maximum of three variants, each of the latter representing a different combination of family members’ alleles.
Our algorithm also allows the construction of haplotypes, without any limitation in terms of the number of genes, i.e. enables to construct haplotypes of more than three genes. This could be a useful instrument for information retrieval and knowledge discovery in genetics, since it would allow epidemiological specialists to discover new intergenic patterns by studying zero-recombinant haplotypes with a larger number of genes.
As long as one child was genotyped, results reveal an unequivocal imputation of three possible parent haplotypes in 92.3% of theoretical cases even when one parent was missing. When neither parent was genotyped, in 36.4% of cases at least two haplotypes were constructed. Regarding offspring allele imputation with both parents fully genotyped, a minimum of one haplotype for each child was successfully reconstructed in 6.1% of possible cases/variants.
A priori knowledge of allele frequency or family size is not required to start up our algorithm. It was tested by simulations and against the Type 1 Diabetes Genetics Consortium (T1DGC) database.
TOP
B07 - Managing computational models and associated simulation descriptions in standard formats
Short Abstract: In order to compare scientific results in the life sciences but also to reproduce the outcomes from partners' research collaborations, re-usability and reproducibility are essential concepts. In computational biology, standards for the storage and exchange of computational models as well as for the encoding of simulation experiments associated with the models have already been established.
Efficient model reuse requires concepts for managing the increasing amount of models and associated data. Database and Information Systems science provides well-proven methods for data management in general. These should now be made available to model repositories in computational biology.
In our poster we present our methods for the management of simulation models and associated information. Important aspects of our work are (1) a fine-grained model storage; (2) a sophisticated model retrieval and ranking; (3) a detailed model version control approach; and (4) the incorporation of simulation experiments performed on the models. We present here how the developed methods can be integrated in a model and simulation management system. We target in our works models and simulation descriptions in standard formats (SBML, SED-ML). A further prerequisite is the models' annotation following the MIRIAM guidelines. They provide a better understanding of the Biology encoded in the model and thereby enable tasks like model display, merging or comparison.
The application of Database and Information Systems methods on models and associated simulations improves the working with such models, enhances the reuse of models and the re-reproducibility of results. All solutions are applicable to SBML and other XML-based standard formats.
TOP
B08 - GEMBASSY: an EMBOSS associated package for genome analysis using G-language SOAP/REST web services
Short Abstract: The popular European Molecular Biology Open Software Suite (EMBOSS) currently contains over 400 tools used in various bioinformatics researches, and the Ajax Command Definition (ACD) interface of EMBOSS provides sophisticated interoperability and discoverability of tools along with rich documentations and various user interfaces. In order to further strengthen EMBOSS in the fields of genomics, we here present a novel EMBASSY package named GEMBASSY, which adds more than 50 analysis tools from the G-language Genome Analysis Environment (G-language GAE) and its Representational State Transfer (REST) and SOAP web services. GEMBASSY basically contains wrapper programs of G-language REST/SOAP web services to provide intuitive and easy access to various annotations within complete genome flatfiles, as well as tools for analyzing nucleic composition, calculating codon usage, and visualizing genomic information. For example, analysis methods for calculating distance between sequences by genomic signatures and for predicting gene expression levels from codon usage bias are effective in the interpretation of meta-genomic and meta-transcriptomic data. GEMBASSY tools can be used seamlessly with other EMBOSS tools and UNIX command line tools. The source code written in C is available from GitHub (https://github.com/celery-kotone/GEMBASSY/) and the distribution package is available from the GEMBASSY website (http://www.g-language.org/gembassy/) under the terms of GNU GPL version 2 for non-commercial and educational use only (limited due to wsdl2h).
TOP
B09 - Developing Next-Generation Sequencing Data Analysis S/W on HPC System
Short Abstract: We are developing genome analysis S/W for optimizing performance on HPC(High Performance Computing) system using heterogeneous computing resources - CPU(Central Processing Unit), GPGPU(General-Purpose Graphics Processing Unit, Intel Xeon Phi. High-throughput sequencing (or next-generation sequencing) technologies produce enormous volume of sequence data inexpensively. HPC(High Performance Computing) system is needed in order to tackle such huge sequence data. Therefore we are parallelizing the genome data analysis pipeline and developing novel applications for genome data analysis using heterogeneous computing resources. Genome Data Analysis SW on HPC(High Performance Computing) system provide following features:
- Sequence read data mapping using parallel computing resources (GPGPU, Intel Xeon Phi, multi-core processor)
- Genome variation detecting using large memory and parallel computing resources (GPGPU, Intel Xeon Phi, multi-core processor).
In order to parallelize the genome analysis pipeline efficiently under HPC environment, we analyzed the CPU utilization pattern of each pipeline steps. In conclusion, we elucidated that sequence read data mapping, especially sequence alignment is computing centric and suitable for parallelization. We also found that manipulating SAM/BAM (Sequence Alignment Map/Binary sequence Alignment Map) files needs very large system memory resources. Finally we could parallelize the pipeline step of the genome variation detection considering the characteristics of data partition for genome wide data.
TOP
B10 - 3D Computer Reconstruction of the Developing Zebrafish Vasculature
Short Abstract: We analyze three-dimensional images of the developing vasculature in the zebrafish Danio rerio and build a 3D computer model (“virtual fish”) of its geometry and developmental dynamics. Images are acquired in life fish using Selective Plane Illumination Microscopy (SPIM). These images are then computationally segmented using a model-based, discrete, unsupervised multi-region algorithm, in which the evolving object contours are represented by computational particles [Cardinale et al. IEEE Trans. Image Proc. 2012]. Images of the vasculature are segmented into two regions: the background and a connected three-dimensional network of vessels as the sole foreground region. In order to quantify the development of the vasculature system, the segmentation is done at different embryonic ages. The resulting segmentations can then be used to represent the vasculature as a three-dimensional graph, and to quantify the formation of new connections between the vessels during development. The resulting network models can be used in future work to simulate the fluid mechanics of the blood flow and study the interplay of fluid forces and angiogenesis. Reconstructing the developing vasculature from different individuals will provide valuable insight into the phenotypic variability of the process.
TOP
B11 - Reconstruction of 3D images of Arabidopsis thaliana to monitor growth
Short Abstract: Live-imaging of plant development and growth is advancing rapidly, as the experimental and imaging technologies have improved greatly during the last years. Current growth measurements are mainly based on imprecise 2D imaging systems lacking the information of different angles of moving plant parts. For example, the rapid change of leaf angles follows the circadian clock and the growth stages. To avoid invasive laser scanning systems potentially affecting growth we implemented a novel so-called light-field 3-dimensional imaging system. We employ this non-interfering single-camera setup for analyzing plant growth at a high spatial and temporal resolution. The light-field camera uses a standard planar macrolens and an array of microlenses with various focus points. The latter provides additional information about the distance and angle of a photographed object. The combined information (image and depth data) is sufficient to reconstruct a 3D image of a plant. To enable recording at night we use infrared light (850 - 1050 nm) to avoid stimulation of plant photoreceptors affecting growth. Plants are recorded 24 hours over several weeks. Thus, we obtain highly accurate, dynamic and continuous long period time-lapse 3D image sequences. The described imaging system is used to analyze Arabidopsis thaliana ecotypes and mutants. This cost effective system allows us to correlate growth behavior to metabolic changes and gene functions with high precision.
TOP
B12 - What is bioinformatics made from - a survey of database and software usage through full-text mining
Short Abstract: Database and software usage defines bioinformatics and computational biology. Identifying best/common practice, and the computational methods used, in different domains could aid in understanding what resources are available, which are used, how much they are used, and what for.
Using various text-mining techniques, we have developed bioNerDS (http://bionerds.sourceforge.net/), a named entity recogniser for database and software name mentions from the primary literature. High ambiguity in resource naming, in combination with the on-going introduction of new resources, resulted in bioNerDS achieving an F-measure ranging from 63-91% at the mention level, and 63-78% at the document level.
We have analysed over 500,000 full-text articles from the entire open-access PubMed Central corpus for database and software name mentions. The resulting data generated by bioNerDS (several million extracted resources) allows us to systematically explore database and software usage on a large-scale and with a minimal bias. Through an analysis of both document and mention level resource usage, we highlight interesting trends, including detailing the most mentioned resources, and evaluating temporal changes (and rate of change) in resource usage. Specifically, variation in resource usage between journals, and between differing bioinformatics sub-domains, in part reflects the nature of each area (e.g., for statistical analysis, R is preferred in biology and bioinformatics, whereas SPSS is preferred in the (bio-)medical domain). Many well-established resources exhibit the expected high usage counts (e.g., BLAST, Swiss-Prot, CLUSTAL W, R and GO), but there are many emerging resources (e.g., Galaxy and MUSCLE).
TOP
B14 - TGAC Browser: visualisation solutions for big data in the genomic era
Short Abstract: We present a web-based genomic browser with novel rendering, annotation and analysis capabilities designed to overcome the shortcomings in available approaches. It is often the case that genomic browser customisations are required between different research groups, with the need to tailor tracks and features on a frequent basis. Many popular browsers use on-the-fly server-side track rendering which is not efficient in terms of performance, scalability or browsing experience. They often rely on specific library dependencies, where writing plugins or modifying existing code can be troublesome and resource-expensive. We focus on functionality which, although potentially available in other browsers more suited to internet architectures, concentrates on improved, more productive interfaces and analytical capabilites:
User-friendly: Live data searching, track modification, and drag and drop selection; actions that are seamlessly powered by modern web browsers
Responsiveness: Client-side rendering and caching, based on JSON fragments generated by server logic, helps decrease the server load and improves user experience
Analysis Integration: The ability to carry out heavyweight analysis tasks, using tools such as BLAST, via a dedicated extensible daemon Annotation: Users can edit annotations which can be persisted on the server, reloaded, and shared at a later date
Off-the-shelf Installation: The only prerequisites are a web application container, such as Jetty or Tomcat, and a standard Ensembl database to host sequence features
Extensible: Adaptable modular design to enable interfacing with other databases, e.g. GMOD
For more information, or to try out a demo instance, please visit http://tgac-browser.tgac.ac.uk
TOP
B15 - NetworkTrail - A Web Server for Finding Deregulated Networks
Short Abstract: The deregulation of biochemical pathways plays a central role in many diseases like e.g., cancer. In silico tools for calculating these deregulated pathways may help to gain new insights into pathogenic mechanisms and may open novel avenues for therapy stratification in the sense of personalized medicine. However, employing computational methods, which are often exclusively available as command line tools, can require a disproportionate amount of technical knowledge.
Here, we present NetworkTrail, a web server offering an user friendly interface to a state-of-the-art ILP approach for the detection of deregulated subnetworks. Starting from an expression dataset or a supplied score file, NetworkTrail guides the user through the process of setting up an analysis. Since a common problem of scientific software is the wealth of available parameters, special care was taken to provide sensible defaults and provide understandable inline documentation. To this end automatic parameter detection is leveraged. The computed results can be downloaded as an archive or directly visualized through a Java WebStart based interface to BiNA or an embedded viewer built on top of Cytoscape Web. In addition to links to external databases, NetworkTrail provides a list of known drug targets extracted from DrugBank, sorted by the degree of deregulation.
NetworkTrail is a Java ServerFaces-based web service and employs AJAX for client-server communication. The algorithm for finding deregulated subnetworks has been implemented in C++. The service currently supports the regulatory networks for human, mouse, rat and arabidopsis from the KEGG database and is available at http://networktrail.bioinf.uni-sb.de/.
TOP
B16 - Use of ISOMAP algorithm in the analysis of spatial gene expression data within mouse embryos.
Short Abstract: Spatial gene expression patterns within mouse embryos presents researchers with a wealth of information. It enables researchers to investigate the expression of genes within specific spatial regions during various developmental stages of the embryo [1]. Given that a mouse has in the order of 20,000 genes and 5100 anatomical structures across its development cycle, this is a potentially huge combinatorial data-set even before we consider the spatial annotation of said data. Given the sheer quantity, the analysis of this data is a difficult task for human researchers to undertake, as such within this research we seek to find a novel computational solution to aid humans with this analysis.
Within this research we seek to apply the non-linear, dimensionality reducing, algorithm known as ISOMAP [2] to the analysis of this data set. ISOMAP enables efficient sorting and clustering of data-sets especially of data-sets composed of images such as the spatial expression of genes within mouse embryos, and enables a quick visual analysis to be conducted of a sub-set of the data.
We have had initial success with this algorithm, in terms of being able to visually sort and cluster spatial gene expression at a given Theiler stage, which enables the clear identification of related gene expression patterns.
This algorithm may also have a wider application to areas such as analysis of radiography images and the analysis of flocking behaviour.
1. Venkataraman S, et al EMAGE - Edinburgh Mouse Atlas of Gene Expression:2008
2. J. Tenenbaum, V. de Silva, J. C. Langford, Science 290,2319(2000).
TOP
B17 - Semantic Data Annotation and RDF Extraction with RightField Spreadsheets
Short Abstract: Consistent metadata and uniform annotation is a necessary requirement for the interpretation and integration of biological data. However, there are many barriers to its acquisition, including the complexity of community ontologies and the lack of support for ontology annotation in the most commonly used data management tools, namely spreadsheets. To lower these barriers, we have created RightField, a java application that provides both ontology annotation support in spreadsheets and a mechanism for extracting and querying RDF data from these enriched spreadsheets
RightField allows ranges of ontology terms to be embedded into particular spreadsheet cells. Multiple ontologies can be used per spreadsheet, and ontology properties (describing the relationship between the term list and the data) can also be applied. The selected ontology terms appear as simple drop-down lists, but the term IRIs and ontology version information is retained in hidden sheets for tracking versioning and provenance.
RightField has been developed for systems biologists, and is used as part of SEEK - a web-based platform for sharing and managing Systems Biology data, models and processes. However, RightField is domain independent. It has been adopted by other biomedical disciplines, such as physiology, translational medicine and clinical data collection; and is starting to be adopted by other disciplines, including egyptology and astronomy. It is also gaining interest as a modelling playground, for designing and experimenting with data models for annotation. RightField is open source (BSD License) and available from http://www.rightfield.org.uk. SEEK is also open source (BSD License) and is available from http://www.sysmo-db.org/seek.
TOP
B18 - Taverna Components: Simplifying the Construction of Workflows from Distributed Services
Short Abstract: Scientific workflows provide a powerful mechanism to create analysis pipelines using distributed tools and data resources. For the analysis and annotation of omics data, for example, such workflows are becoming increasingly important in bioinformatics. The ability to connect to distributed tools, like WSDL or RESTful Web Services, means that scientists do not have to install or maintain local copies of tools and can make use of distributed computational power.
However, one of the major drawbacks of using and combining distributed resources is their incompatibility. Workflows which are conceptually simple linear pipelines can become more complicated by the addition of necessary data formatting and transformation steps. In addition, individual web services are often poorly annotated, which means that understanding how to execute and combine them is difficult.
The Taverna workflow project has been addressing this issue by providing a framework for combining native, distributed services with data transformation steps. These resulting collections produce standard, interchangeable components with a uniform structure. A family of components share a profile, with uniform annotation using common ontologies. These profiles can be used as workflow definitions, or templates to create new workflows. Like ordinary Taverna workflows, components can be published and shared through myExperiment, allowing other scientists to discover and re-use them.
Components allow easier workflow construction, annotation and sharing. This makes workflow analyses an accessible option for Life Science researchers outside of bioinformatics.
TOP
B19 - Generation Of Expression Calls For Rna-Seq Data
Short Abstract: The Bgee database (database for Gene Expression Evolution http://bgee.unil.ch) provides information about genes that are expressed in different organs, tissues and developmental stages. In order to introduce RNA-seq results into Bgee we had to develop a methodology for deriving expressed/unexpressed calls for genes. Such detection calls can be used for the characterization of the tissue gene expression profile. Additionally detection calls are widely used in transcriptomic studies for filtering the genes used for differential expression analysis, for clustering samples or for building more reliable classifiers. If we took, as criterion of transcription,the presence of at least one uniquely mapped read, then many intergenic regions would be classified as expressed, which we believe would be uninformative. The goal of our work was to find the best way to define the cut-off value on a transcription level that allows discrimination between biologically relevant expression of genes and transcription level coming from experimental noise or background activity of the transcription machine. In our methodology we try to estimate unspecific expression level on the basis of randomly selected intergenic fragments in each library individually. Despite up to 4 times differences in the number of aligned reads between libraries, the proportion of genes called expressed by our algorithm remained consistent among different samples. Less than 15% and 20% of intergenic regions for mouse (n=17) and human (n=16) data accordingly were ever called “expressed”. In contrary more than 80% of mouse and 90% of human protein coding genes were at least once called “expressed”.
TOP
B20 - HTS fingerprints in hit finding and lead discovery
Short Abstract: Virtual screening using bioactivity profiles is a major component of currently applied hit finding methods in pharmaceutical industry. At our institution, small molecules are compared via their HTS fingerprints (HTS-FPs), i.e., their activity outcomes in more than 200 biochemical and cellular primary assays, which have been run historically in the company. Globally seen, the HTS-FP of a compound represents its interaction with the proteome. If a small molecule perturbs a biological system in a desired manner, compounds with similar HTS-FPs are identified and tested in the hope to find additional compounds with the same mode-of-action. In exhaustive benchmark trials, the effectiveness of HTS-FPs to retrieve structurally novel molecules in virtual screening and hit expansion has been demonstrated. However, hit identification using HTS-FPs suffers from the same limitation as all other bioactivity-based methods: they can only be applied to compounds that have been profiled in the past and are biologically annotated. To overcome this limitation, we have developed a virtual screening approach that integrates chemical and biological similarity and allows us to exploit the power of HTS-FPs for compounds that are not part of the corporate screening collection and have not been assayed previously. The approach was benchmarked on primary screening data sets and also applied to a set of natural products, which demonstrated the ability of the approach to depart from complex reference structures and retrieve synthetically more accessible chemical matter that exhibits the desired biological activity.
TOP
B21 - Improving Fragmentation Trees
Short Abstract: Tandem mass spectrometry is a key technology for sensitive, automated and high-throughput analysis of small molecules such as metabolites. Because manual interpretation of mass spectra is tedious and time-consuming, fragmentation trees have been introduced for their automated analysis (Böcker and Rasche, Bioinformatics 2008). This approach uses fragmentation patterns in MS2 spectra to identify the molecular formula and annotates the fragments as well as the fragmentation reactions without the use of spectral databases.
We greatly improved the quality of fragmentation trees by learning the parameters of our method from the spectral data of reference compounds. We use a dataset which contains 3529 MS2 spectra covering a wide range of different classes of small biomolecules. We split the data into a training and a test set. We compute fragmentation trees for all spectra in the training set. For the trees which identify the correct molecular formula we count the annotated losses and fragments and adjust the parameters of the loss mass distribution with respect to the observed losses. Formulas of losses and fragments which occur more often than other formulas with the same mass are then favoured in the scoring. This analysis is repeated in an iterative manner. With increasing quality of the scoring function more molecular formulas are correctly identified. To avoid overfitting of the model, we evaluate the learned parameters on the test set.
The analysis leads to an improved scoring function which results in better quality of fragmentation trees and a higher probability to identify the correct formula.
TOP
B22 - Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed
Short Abstract: Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed.
In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms.
Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.
TOP
B23 - Isotope cluster based compound matching in Gas Chromatography/Mass Spectrometry for non- targeted metabolomics
Short Abstract: Gas chromatography coupled to mass spectrometry (GC/MS) has emerged to be a powerful tool in metabolomics studies. A major bottleneck in current data analysis of GC/MS based
metabolomics studies is compound matching and identification, as current methods generate
high rates of false-positive and false-negative identifications. This is especially true for data sets
containing a high amount of noise. In this work, a novel spectral similarity measure based on the
specific fragmentation patterns of electron impact mass spectra is proposed. An important aspect
of these algorithmic methods is the handling of noisy data. The performance of the proposed
method compared to the gold standard, the dot product was evaluated on a complex biological
dataset. The analysis results showed significant improvements of the proposed method in
compound matching and chromatogram alignment compared to the dot product.
TOP
B24 - From CABRI to MIRRI: the evolution of an ITC infrastructure for microbial resources
Short Abstract: Microbial resource centers (MRCs) have offered services to the scientific community for centuries. In the Internet era, many efforts have been undertaken to integrate their data and services.
The EU project Common Access to Biological Resources and Information (CABRI) implemented unified access to culture collection catalogues, also guaranteeing a common level of quality of material and related information. Contents of partner catalogues were compared and both Minimum and Recommended data sets were defined. A 'one-stop-shop' for biological resources was finally achieved: researchers can search, select and pre-order strains. CABRI platform is based on SRS (see www.cabri.org).
In the context of the ESFRI programme, the Microbial Resource Research Infrastructure (MIRRI) started its preparatory phase in 2012 aiming to provide a wealth of microbial resources, associated data, taxonomic methods, and expertise to serve users' needs. MIRRI “Data Resources Management” activity serves to improve the quantity, quality, interoperability, and usage of data associated with biological material.
No standardized protocols are presently available for submission of strain specific metadata to collections resulting in heterogeneous and incomplete datasets. MIRRI will address this by developing concepts and standards for data acquisition. Common strategies for evaluation, curation, integration, and interoperability of data across MRCs will be developed. Moreover, the requirements for data access and for a user-friendly interface will be investigated. An assessment on existing data integration tools, platforms, standards, and projects, including CABRI, will be part of the activities to avoid redundancy and duplication and to harness existing know-how.
TOP
B25 - Piggeldat: a data management, integration and publication website for a multidisciplinary research consortium.
Short Abstract: Pigs are important model animals for issues related to nutritional principles and they have a great relevance for food industry. The German Sonderforschungsbereich 852 (SFB 852), a collaborative research consortium, performs several animal trials in order to analyze the effects of feed additives to the development of piglets. Involved working groups perform measurements from various fields and obtain quantitative and qualitative information about intestinal microbiota and physiology of piglets. The SFB 852 uses the website Piggeldat for management, integration and publication of these diverse data-sets. We created Piggeldat by the open source content management system Drupal in combination with our customized module Catria. Piggeldat stores measurement data in the file system of the server and associates them with related meta-data, which are stored in a PostgreSQL database. SFB 852 members are able to allow the scientific community access to own measurement data-sets. The website provides several tools for browsing, illustration and analysis of data.
TOP
B26 - GenomeSpace: An environment for frictionless bioinformatics
Short Abstract: GenomeSpace, www.genomespace.org, is an environment that brings together diverse computational tools, enabling scientists without programming skills to easily combine their capabilities. It aims to offer a common space to create, manipulate and share an ever-growing range of genomic analysis tools. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analyses, automatic conversion of data formats, and ease of connecting new tools to the environment. A repository of analysis “recipes” provides a growing collection of useful multi-tool protocols for common analysis tasks.
A set of six “GenomeSpace-enabled” seed tools developed by collaborating organizations provides a comprehensive platform for the analysis of genomic data: Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC). The extensible format of the system has empowered a wider range of bioinformatics analyses through the addition of ArrayExpress (European Bioinformatics Institute), InSilico DB (University of Brussels), geWorkbench (Columbia University), Cistrome (Dana-Farber Cancer Institute), gitools (University Pompeu Fabra), and ISACreator (Oxford University), with additional tools added to the GenomeSpace environment on an ongoing basis. GenomeSpace is freely available open source software.
TOP
B27 - From Short to Long Reads: Benchmarking Assembly Tools
Short Abstract: An increasing number of DNA de novo assembly tools are being developed, each claiming to produce better results in some aspect than their competition. It is, however, interesting that not enough attention has been paid to their comparative evaluation. Even in cases where the quality of their results has been tested, it is hard to find information on their execution performance.
We designed a benchmarking methodology and applied it to several DNA de novo assembly tools. Unlike other comparative studies, our primary goal was to focus on assemblers resource consumption as a function of varying lengths and coverages of input read sequences. Since such study is very time consuming, we have currently performed benchmarking on a limited number of assemblers, and report here the preliminary results.
We have defined a collection of 77 datasets of simulated read sequences of E. Coli, designed to cover the space of varying read lengths and coverages. Benchmarking was performed on two de Bruijn graph (DBG) based assemblers, Velvet and SOAPdenovo, and two overlap graph (OG) based assemblers, SGA and Minimus.
Preliminary results show that DBG-based assemblers generally perform faster than OG-based ones. Additionally, DBGs memory consumption reaches a plateau at some point. The two tested OGs produce differing memory results, presumably because of different underlying alignment algorithms. However, DBGs seem to produce much lower N50 and maximal contig lengths than OGs, especially for longer reads. We conclude that OG is the approach of preference for the upcoming sequencing technologies that will produce longer reads.
TOP
B28 - Data publication and interoperability with nanopublications
Short Abstract: Published data, whether in traditional publication formats such as research articles or in databases often lack a consensus structure (which slows search and reasoning) and provenance and citation models (which lowers incentive for publication. Furthermore, in some disciplines the growing rate of data production exceeds the capacity of human comprehension. Together, these trends lead to the loss of valuable data from scientific discourse. Nanopublication is a data publication model built on top of existing Semantic Web technologies to counter these data dissemination and management trends. A nanopublication represents the smallest unit of publishable information and consists of an (i) assertion and (ii) provenance. The assertion takes the form of one or more semantic triples (subject-predicate-object combination). The provenance describes how the assertion `came to be', and includes supporting information (e.g., context, parameter settings, a description of methods) and attribution to the authors (of content) and creators (of the nanopublication), institutions supporting the work, funding sources and other information like date and time stamps and certification. Creating a nanopublication requires a one-time effort to model the assertion and provenance as RDF named graphs. After submission to an open, decentralized nanopublication store, the nanopublication will be available both to humans and automated inference and discovery engines. Nanopublications can be used to expose quantitative and qualitative data, experimental data as well as hypotheses, novel or legacy data and negative data that usually goes unpublished. Nanopublications are meant to augment (not replace) traditional long-form narrative.
TOP
B29 - PopGenome: An efficient swiss army-knife for population genetic & genomic data in R
Short Abstract: PopGenome is a new package for population genomic analysis and method development, based on the powerful, open-source, statistical computing environment R. R is available for all major operating systems and has built-in high level scientific graphics capabilities. PopGenome includes, e.g., a wide range of polymorphism, neutrality statistics and FST estimates, which are applicable to sequence data stored in alignment format, as well as whole genome SNP data from the 1000/1001 Genome projects. The full range of methods can be applied to sliding-windows based on either the whole genome or only the SNPs. PopGenome is also able to handle GFF/GTF annotation files and automatically specifies the SNPs located in, e.g., exon or intron regions. Those subsites can be analyzed at once (e.g., all introns together) or each region seperately (e.g., one value per intron).
The PopGenome framework is linked to Hudson’s MS program for significance tests using coalescent simulations.
We envision the open source PopGenome project to form a basic framework for the implementation of new methods by population geneticists, much as the BioConductor R project provides a framework for, e.g., new microarray analysis methods.
PopGenome is freely available under the GNU General Public License from CRAN, and its components can be freely extended and reused.
TOP
B30 - MOLE 2.0: Improved Approach for Analysis of Biomacromolecular Channels and Pores
Short Abstract: Biomacromolecular channels and pores play significant biological roles, e.g., in molecular recognition and enzyme substrate specificity; this information can be further utilized in biotechnology applications aimed to design not only more effective and selective enzymes. Unquestionably identification and characterization of channels is fundamental to understanding of numerous biologically relevant processes and makes a starting point for rational drug design, protein engineering and biotechnological applications.
Here we present a novel tool MOLE 2.0 capable of advanced rapid calculation of protein empty voids with the strong emphasis on substrate/product access/egress channels detection. This new version is based on its predecessor, well received by the scientific community, outperforms its shortcomings and significantly improves the speed of the calculation. Thus, making the MOLE 2.0 suitable even for large biomacromolecular channel system analyses, which are beyond the capabilities of any other comparable up to date software. On the top of that, MOLE 2.0 introduces the novel concept of physicochemical channel properties, providing further information about the analyzed data. Not only is our software available with user friendly graphical interface with immediate result visualization, but also as a standalone binary for processing large datasets such as the outputs of molecular dynamics or whole families of proteins.
MOLE 2.0 is available free of charge from the webpages http://mole.chemi.muni.cz
TOP
B31 - Discovery of survival gene markers for cancer prognosis using genome-wide expression profiles
Short Abstract: The identification of gene signatures that discriminate between several cancer types using the gene expression profiles derived from genome-wide analyses has been addressed by many authors. Although, gene marker discovery linked to patient prognosis and survival has become increasingly important, it has received less attention and remains a challenging task.
In this work, we present a method to identify prognostic gene markers that allow us to partition the sample into groups that maximize the separability between their Kaplan-Meier curves. The separability is evaluated through a trimmed log-rank test. The p-value provided by this test allow us to rank the genes according to their prognostic power. Next, we have developed a methodology to study the association between the best prognostic markers and several clinical variables considered relevant for patient outcome. Finally, the feature selection method has been extended to identify groups of two genes (binary-markers) that relate to the patient prognosis. In particular, the algorithm looks for pair of genes such that both of them change their states in patients of poor outcome.
The method proposed has been applied to discover genes associated to response and survival following neoadjuvant taxane-anthracycline chemotherapy for HER2-negative invasive breast cancer. A validation independent cohort of 198 breast cancer patients has been considered in order to evaluate rigorously the algorithms performance. The method is also validated with other cancer series. The experimental results suggest that the feature selection method proposed is able to recover relevant genes in breast cancer prognosis that are missed by other methods.
TOP
B32 - Accelerating Hybrid Error Correction and Assembly of Single-Molecule Sequencing Reads
Short Abstract: Multikilobase sequences from the PacBio RShave the potential to span repetitive regions, thereby simplifying and improving genome and transcriptome assembly and finishing. However, error rates for these long reads somewhat limit their utility. Koren, et al., recently developed an assembly strategy that uses short, high-fidelity sequences to correct the error in single-molecule sequences. This algorithm for PacBio corrected reads (PBcR) achieves >99.9% base-call accuracy, leading to better assemblies than other sequencing strategies. In one example, the median contig size quintupled relative to high-coverage, second-generation assemblies. From a computational perspective, PBcR requires as much time to run as the subsequent assembly with Celera Assembler (CA); in the case of the parrot genome, each step requires about 20K core hours to run.
In this work, we describe optimizing PBcR to take advantage of the highly parallel processing architecture of a Convey Hybrid-Core (HC) server in order to run much more quickly. The overlap subroutine of PBcR requires the most compute time; the all-versus-all overlaps between long- and the short-read sequences use a seed and extend approach based on the Smith-Waterman algorithm. Smith-Waterman searches on an HC-2ex server are 14.5x faster than the best implementation on a standard server, making the overlap subroutine a great candidate for optimization on HC. The overlap subroutine is also a significant step in CA’s overlap consensus assembler, so optimizations of the subroutine on the HC server improve the runtime performance of both PBcR and CA.
TOP
B33 - Cross-platform Evaluation of the Performance of Microarrays and RNA-seq
Short Abstract: BACKGROUND
The vast majority of high throughput transcriptome studies to date use either microarray or RNA-sequencing platforms. However their performance has not been systematically compared. In this study we utilized a comprehensive RNA-seq dataset of four titration pools from two human reference RNA samples, and evaluated the performance of several commercially available microarrays (Affymetrix Hu133plus2, PrimeView, HuGene2.0 and the new exon-junction array HTA2.0).
RESULTS
The performance of microarrays in differential expression profiling is compared with RNA-Seq. The results show that, in terms of reproducibility, accuracy and detection power, different microarrays are comparable to RNA-Seq of different read depth. The new human exon-junction array HTA2.0 performs better than the HuGene2.0, Hu133plus2 and PrimeView for both absolute and relative quantification. Analysis of the coefficient of variation (CV) indicates microarray is reproducible across its dynamic range, with HTA2.0 having the smallest median CV. Further analyses of the titration order and the linear relationship among mixture samples suggest that microarrays can recover the ground truth correctly, with HTA2.0 performing the best at both the gene and exon level.
CONCLUSIONS
Microarrays are comparable to RNA-Seq for differential expression profiling. The new exon-junction arrays of HTA2.0 give the best performance for all the metrics examined. The high-density tiling design of HTA2.0, representing an in silico sampling of reads, making it produce more similar result as RNA-Seq.
TOP
B34 - Adaptable Probabilistic Short Read Alignment Using Position Specific Scoring Matrices
Short Abstract: This poster presents an adaptable probabilistic alignment algorithm yielding an improved sensitivity for the mapping of short reads obtained from experiments with a known substitution bias. While the ability and necessity for modern alignment methods to cope with the longer reads produced by modern high-throughput sequencing methods has been at the forefront of read mapping technology, an equally important segment is often left out of the spotlight. A multitude of experimental methods and endeavours such as PAR-CLIP, ancient DNA, and bisulfite sequencing yield reads which contain an excess of one particular type of substitution over another. The presence of these supplementary base changes can significantly impair the ability of existing aligners to confidently find the true origin of a read in a genome.
The method presented here employs the use of position-specific scoring matrices (PSSMs) to calculate a score and associated probability for the mapping of a read to a particular position in genome given prior knowledge about the potential modifications or substitutions the read may have undergone prior to sequencing. Combining an algorithm developed for rapid alignment of short reads with a method for motif finding, we present a sensitive, adaptable probabilistic approach to read mapping which greatly improves the sensitivity of aligning reads with a proclivity for particular base-pair substitutions. Our approach is fast, general and extensible, allowing for easy experimentation with different data sets and downstream analysis methods.
TOP
B35 - Analysis of Resting-State fMRI using Pearson-VII Mixture Modeling
Short Abstract: In the past two decades, functional magnetic resonance imaging (fMRI) has been widely used to research and characterize neural activity in the brain based on measuring the hemodynamic response correlated to neuronal firing. A main goal of fMRI analysis is to characterize functional connectivity – correlation in activation pattern – between different regions of the brain for different task- or disease-related events. Here, we present a novel, data-driven method for identifying functionally connected regions of the brain using a Pearson-VII mixture model learning algorithm (p7-means). This method is complementary to independent component analysis (ICA), which derives underlying source signals, rather than probabilistic distributions, to characterize brain components. The p7-means algorithm is powerful in its ability to model a range of leptokurtic to Gaussian components, which makes it robust in identifying core functional components in noisy images. Additionally, p7-means has the advantage of learning the number of components from the data set, rather than returning a fixed number of components matching dimensionality. We apply the algorithm to resting-state monkey fMRI and compare the discovered components to those found in ICA. Correlational analysis shows consistent activation components between the two methods, although p7-means results in more spatially localized groups.
TOP
B36 - Privacy-preserving search for a chemical compound database
Short Abstract: Searching similar compound from a database is among the most important approaches in the process of drug discovery. Since a query compound is an important starting point for a new drug, the query compound is usually treated as secret information. The most popular method for a client to avoid information leakage is downloading whole database and using it in a closed network, however, this naive approach cannot be used if the database side also wants to keep its privacy. Therefore it is expected to develop new method which enables to search a database while both sides keep their privacy. In this study, we address the problem of searching for similar compounds in privacy-preserving manner, and propose a novel cryptography protocol for solving the problem. The proposed protocol is based on semi-homomorphic encryption and is quite efficient both in computational cost and communication size. We implemented our protocol and compared it to general purpose Multi party computation (MPC) on a simulated data set. We confirmed that the CPU time of the proposed protocol was around 1000 times faster than that of MPC. The protocol can be used for the database where data is represented as a bit-vector, therefore, we expect that our protocol will be applied for various kind of problems appeared in the field of bioinformatics.
TOP
B37 - Parallelizing the Indexing Stage in the AbySS Genome Assembler
Short Abstract: Currently, the ABySS algorithm operates in three stages: sequence assembly, paired-end assembly and scaffolding. The sequence assembly portion accounts for approximately only 20% of the pipeline. Generating the indexes of unitigs and contigs to be used for aligning the reads is a costly operation in memory use and time. For the human genome assembly we have observed this single threaded portion taking up to 180G of RAM. Aligning the reads accounts for about 80% of the last two stages of AbySS, thus constituting a significant portion at an overall 64% of the whole assembly process. For the human genome this can take up to 16 hours, on modern processors.
In this work, we are aiming at significantly improving the performance of the indexing portion by parallelizing the abyss-index portion of the ABySS assembly pipeline. We are doing this by partitioning the assembly graph created in the first phase of ABySS, and by indexing individual partitions in parallel, on low memory machines. We also align reads in parallel and finally recreate a BAM compliant output file, merging all part-alignments. This process poses a few challenges, starting with a balanced and efficient partitioning of the graph, such that we keep adjacent unitigs in the same partition. We also identify which read falls into each partition, for faster alignment. We address this problem with the aid of Bloom filters constructed on each partition, to identify membership.
In conclusion, this work promises significant improvements that will show considerable results, especially for larger size genomes.
TOP
B38 - Light-weight modular constraint enforcement for scientific workflow systems
Short Abstract: With the rapidly growing amount of data generated in in the modern life sciences and the ever increasing complexity of state-of-the-art analyses, the quality control and supervision of individual computational analysis steps has become a considerable challenge in the field of bioinformatics and related disciplines. While software systems have been developed to support data analysis pipelines, they typically provide a useful platform for the automation of routine data analysis steps as commonly found in industrial or facility settings. The integration of new tools and data, however, is often complicated and the generation of new workflows can be very time consuming and limited by graphical user interfaces. With little support for rapid prototyping and revision control, most systems lack the flexibility required for iterative explorative analyses common in scientific research. We present a light-weight modular system for development cycle control and policy based specification of rules and requirements that supports transparent data provenance and in-flow enforcement of consistency constraints. Its application over the life-cycle of typical analyses is demonstrated on multiple use-cases, including sequence feature detection in comparative genomics and sequence analysis, and model-based optimization of experimental design for genome-scale transcript expression profiling experiments.
TOP
B39 - Parallel-QC: Parallel computational engine for NGS data quality control
Short Abstract: Next-generation sequencing (NGS) technologies have become common practice in many areas of life sciences. However, raw NGS reads include different types of sequencing artifacts, such as low sequencing-quality reads and contaminating reads, which could compromise downstream analyses. Therefore, quality control (QC) on raw NGS data is crucial. Key steps in NGS data QC would include sequencing quality assessment and contamination screening, both of which are currently time-consuming and highly dependent on pre-defined information such as the source of contaminations. Unfortunately, most of current NGS data QC tools could not identify contaminating sources de novo, which is usually not available in advance. More importantly, the processing speed of current QC tools has become one major bottleneck in handling large amounts of NGS data. Therefore, a general computational engine to conduct both sequencing quality assessment and contamination screening would be desirable.
Here we report Parallel-QC, a parallel NGS data quality-control computational engine, which could perform both (1) read-quality assessment and trimming, and (2) identification of unknown contamination in the raw NGS data. By Parallel-QC, low sequencing-quality reads could be trimmed and all the possible contaminating sources could be identified de novo, without any a prior information of the species. Moreover, Parallel-QC was optimized based on parallel computation, thus the processing speed is several-fold faster than other QC methods. The program was developed by Linux C++ and multi-thread technology based on X86-64 multi-core CPU platform. As a computational engine, it could also be easily embedded in other pipelines.
TOP
B40 - Comparison of the DNA amplification methods on SNP identification for single-cells
Short Abstract: Over the past decades, researchers have sequenced thousands of genomes from all kinds of species. Almost all sequenced DNA was extracted from millions of cells, in which there’s a lot of cell-to-cell variability that are hidden behind the population profiles. For better understanding of the feature of single cell at genome level, it is necessary to isolate, amplify, sequence and deal with genomic data of single cell.
The single-cell genomic data extraction and analysis dependents on DNA amplification, for which currently two techniques are commonly used: MDA (multiple displacement amplification) is an isothermal amplification method that uses random primers and Phi29 DNA polymerase for generating fairly large fragments (10-20Kb) with high fidelity. MALBAC (Multiple annealing and looping-based amplification cycles) is developed with improving the amplification evenness compared with the MDA method, in which DNA from a single cell is isolated, and then short DNA molecules called primers are added.
In this study, three datasets are used to compare the effect of MDA and MALBAC amplification in single cell analysis. Raw reads of human sperm cell genomes are mapped to reference genome by SOAP, BWA and Bowie, and then Samtools and SOAPsnp are used to perform SNPs detection. The consensus results based on these methods are considered SNP candidates for single-cells. As we illustrated the coverage of mapped reads in reference genome, results have shown that though SNP discovered based on MALBAC data are more in number, the distribution of SNPs on genome are mostly even based on both amplification techniques.
TOP
B41 - Update on model management techniques and tools: Model version control and extended model search
Short Abstract: Simulation models have become a standard tool for many bioinformaticians to test and evaluate their hypotheses. Consequently, model management is necessary to handle the increasing number of computational models of biological systems. However, the reuse of existing models is hindered by the fact that models can be hard to find; model changes are not clearly propagated to the users; or that the model - when downloaded from a model repository - cannot readily be reused and the simulation results cannot be reproduced.
The SEMS project (http://sems.uni-rostock.de) focuses on strategies for model and simulation management. Recent developments include a ranked retrieval for SBML and CellML models, a graph-based, integrated storage of models and simulation, and a version control method for SBML and CellML encoded models.
In this presentation we give an overview of the most important management tasks and discuss how the development of suitable methods and tools will improve model reuse and result reproducibility. We will outline our methods for ranked retrieval and version control, show our graph-based storage approach and explain our methods for model version control and difference detection. All methods will be demonstrated in our prototypic software implementations.
TOP
B42 - Meta-Mesh: Metagenome Database and Data Analysis System
Short Abstract: It has long been intriguing scientists to effectively find similar microbial communities from a large repository, to know the meta-information of these samples and to examine how similar these samples are. In this study, we have proposed a novel system, Meta-Mesh (http://www.meta-mesh.org/), which includes a database and its companion analysis system that could systematically and efficiently search similar metagenomic samples. In the database part, we have collected more than 7, 000 high quality and well annotated metagenomic samples from the public domain and in-house facilities. The analysis part includes a list of tools which could accept metagenomic samples, build taxonomical annotations, and then search for similar samples against its carefully selected, well-organized and annotated database by a fast scoring function. It has a multi-thread submission portal and a well-designed data management client for easy submission of large and complex data sets and integrates a variety of viewers to provide a visualization solution for result analysis. Users can also use Meta-Mesh to compare their samples and get a similar score matrix. In the Meta-Mesh online service work, user access is protected to ensure data privacy.
TOP
B43 - NGS Logistics: Data infrastructure for efficient analysis of NGS sequence variants
Short Abstract: This poster is based on Proceedings Submission 25973.
Next-Generation Sequencing (NGS) is quickly becoming a key tool in research and diagnostics of human Mendelian, oligogenic, and complex disorders. As the price and turnaround time for sequencing has dramatically decreased over the past decade, large amounts of human sequencing data are now available. Consequently major challenges have risen in terms of data storage, management, exchange, and for federated analysis. This also raises substantial ethical and privacy issues. In order to tackle these challenges we developed an online tool (NGS-Logistics) that fulfills all requirements of a successful application that can process Next Generation Sequencing data inclusively and comprehensively from multiple sources while guaranteeing privacy and security. A key feature is that queries are executed across multiple centers without moving primary data (bam, vcf files, etc.) around. The development of NGS-Logistics has significantly reduced the effort and time needed to evaluate the significance of mutations from full genome sequencing and exome sequencing, in a safe and confidential environment. This platform provides more opportunities for operators who are interested in expanding their queries and further analysis.
TOP
B44 - Improving the EMBL-EBI online experience: user-centred design and inspiration from the BBC
Short Abstract: It is recognised that bioinformatics websites often suffer from usability problems; for example, they can be too complex for the infrequent user to navigate, and they can “lack sophistication” compared to other websites that people use in their daily lives. With these problems in mind, we show how a user-centred design process can be applied to the development of bioinformatics services.
Specifically, our poster showcases the techniques and processes we used to systematically redesign the European Bioinformatics Institute's website. We also explain how and why we took inspiration from the BBC's Global Experience Language (GEL), and information architecture, to improve the online user experience.
TOP
B45 - A fast whole-genome detection of LD-based haplotype blocks
Short Abstract: Investigation of linkage-disequilibrium (LD) patterns across large genomic segments is a key element in genetic association analysis. LD-based haplotype block modeling can be used in genome-wide haplotype association studies, set-based analyses where adjacent single nucleotide polymorphisms (SNPs) are grouped together, interpretation of genome-wide association studies, and assessment of the structure of high density segments processed with the new sequencing technologies.
The haplotype block recognition in such segments is limited by the poor computational scalability of available software. We developed the MIG++ - a memory and time efficient implementation of the Gabriel et al. 2002 algorithm. MIG++ has linear memory complexity and by >80% improved runtime compared to the original algorithm, which has quadratic memory and runtime complexities. Runtime can be further improved by approximate estimation of the variance of the LD statistic. MIG++ incrementally constructs haplotype blocks and reduces a search space. We theoretically proved that the reduction preserves all blocks and experimentally showed that it omits >80% of computations genome-wide. MIG++ avoids restrictions on the maximal block length, considers SNP pairs at any distance, and can handle any number of SNPs.
MIG++ processed the HapMap-II (120 CEPH haplotypes, 2.5M SNPs) in ~1 hour (~150Mb) and the 1000 Genomes (170 CEPH haplotypes, 10M SNPs) in ~44 hours (~3.6Gb) using the approximate LD estimations. The longest detected haplotype blocks in both datasets span for >1 Mbp.
The memory efficiency and runtime improvements of MIG++ facilitate the integration of LD pattern recognition into the analysis of genome-wide studies and next-generation sequencing datasets.
TOP
B46 - An integrative rare diseases research portal
Short Abstract: The latest advances regarding modern life sciences hardware and software technologies brought rare diseases research back from the sidelines. Whereas in the past these diseases were seldom considered relevant, in the era of whole genome sequencing the direct connections between rare phenotypes and a reduced set of genes are of vital relevance.
The increased interest in rare diseases research is pushing forward investment and effort towards the creation software in the field, and leveraging on the wealth of available life sciences data. Alas, most of these tools target one or more rare diseases, including only the most relevant scientific breakthrough in its specific niche. Hence, there is a clear interest in new strategies to deliver an holistic perspective over the entire rare diseases research domain.
This is Diseasecard's reasoning, to build a true knowledge base for all rare diseases. Built using the latest semantic web technologies included in the COEUS framework (http://bioinformatics.ua.pt/coeus/), the Diseasecard portal delivers unified access to a comprehensive rare diseases network for researchers, clinicians, patients and bioinformatics developers.
Connecting over 20 distinct heterogeneous resources, Diseasecard's web workspace provides a direct enpoint to the most relevant scientific knowledge regarding a given disease, through a navigation hyper-tree, LiveView interactions and full-text search, enabling in-context browsing. Diseasecard is publicly available online at http://bioinformatics.ua.pt/diseasecard/.
TOP
B47 - Estimating the unsequenced with non-parametric empirical Bayes Poisson models
Short Abstract: In the early days of genomic sequencing, researchers could rely upon the Lander-Waterman model to guide their sequencing efforts. Unfortunately, the simple Poisson model does not hold for next generation sequencing due to biases present in the sequencing process or the library preparation, both technical (i.e. uneven PCR amplification) and natural (i.e.untranslated regions in RNA-seq). More general models that allow arbitrary bias from sampling from an arbitrary number of molecules are necessary to achieve accurate predictions. Questions of interest may include how many distinct molecules or loci can be expected, how many total molecules or loci are contained in the library, or what the relative frequency of the remaining molecules. We present a non-parametric empirical Bayes Poisson model of Good & Toulmin, previously used to answer analogous questions in the theory of random sampling known as capture-recapture, and apply them to genomic sequencing. Previously this model was not suitable for long-range predictions due to large variations inherent in the power series representation of the estimates. This problem is exacerbated by the size and scale of experiments typical of genomic sequencing. This scale, however, allows us to apply rational function approximations to the Good & Toulmin model. This transformation allows us to achieve stable and long-range predictions far more accurate than existing methods including the Lander-Waterman (simple Poisson) and the Negative Binomial.
TOP
B48 - Allele specific expression and RNA editing analysis with eXpress
Short Abstract: We present eXpress 2, an updated software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time and can be applied to RNA_Seq, ChIP-seq, metagenomics and other large-scale sequencing data.
The original eXpress algorithm has been modified to provide better performance on personal transcriptomes for improved allele-specific expression. The model has also been extended to probabilistically model references sequences, allowing it to adapt to inexact references as well as to identify RNA-DNA differences such as editing.
We demonstrate eXpress 2 on RNA-seq data and show that it achieves greater efficiency than other quantification methods while providing novel functionality for biological discovery.
TOP
B49 - OnTheFly 2.0: Automated Annotation of Scientific Documents
Short Abstract: Motivation: Retrieving information and relevant knowledge about biological entities mentioned in a set of documents such as those covering the daily literature on a particular biological topic is not an easy task.
Results: OnTheFly-2.0 and its previous version (OTF) (http://onthefly.embl.de), is capable of annotating proteins, genes and chemicals in commonly used document files such as TXT, PDF and Microsoft Office Word/Excel files. At-a-glance summaries of relevant knowledge are attached to the recognized entities and by querying quality data integration platforms, OnTheFly (OTF) supports the generation of interaction networks and informative summaries of all the entities mentioned in a set of documents. Currently, new features of OnTheFly-2.0 include a new front-end to improve user-friendliness, recognition of proteins/genes for more than 600 species, integration with bioCompedium (http://biocompendium.embl.de) knowledge-summary resource and richer information summaries.
Methods: OnTheFly-2.0 front-end is based on JavaScript libraries for file drag-and-dropping while the server side components consist of a local annotation database (summary generation), a set of document converters and modules to invoke the Reflect service (http://reflect.ws) (document annotation), the STRING (network generation) and the bioCompedium (knowledge collection) resources.
Future directions: OnTheFly-2.0 will include richer entity recognition for species names, functional traits, ontologies and ecosystem types, document clustering, document prioritization, co-occurrence analysis and data integration with a plethora of other services.
TOP
B50 - Comparison of Scatterplot Clustering for the Integrative Analysis of Methylation and Expression Data
Short Abstract: DNA methylation, is one of the main systems for epigenetic regulation. To study how methylation acts one can look at the expression of the gene and the methylation percentage of the associated locus in a scatterplot whose shape is characteristic of the degree of regulation: if a gene is regulated by methylation it can show either negative correlation with gene expression or an L-shape usually associated with low or null correlation. It has been seen that these situations may appear in a variety of patterns which makes it difficult to classify or even to clearly decide if the gene is or not regulated by methylation. In a previous work the authors developed a heuristic method to classify the scatterplots. Although the method performed relatively well it could not perform in a completely automatic way and had difficulties in defining a threshold to call a gene “regulated by methylation”. In this work we extend this investigation by applying other available methods such as clustering based on data-depth measures and regression-based clustering. The results of the study show that any of these methods can be a good strategy for the analysis and that, overall, clustering the scatterplots is a good approach to identify regulated genes and types of regulation, although it is also shown that no method performs definitely better than the other and that this performance depends on a variety of factors that range from the available samples, the sample size or the disease type.
TOP
B51 - Extracting how protein interaction networks improve classification performance in gene expression data analysis
Short Abstract: Classification of samples based on micro-array data is a well-studied problem in the field of computational biology. Having several thousands of features and only hundreds of samples in the best case, one of the challenges in that area is how to extract the important genes that distinguish one class from the other. There have been studies to incorporate protein interaction networks into the feature selection and/or classification procedure to find a better classifier. What has not been studied, is how exactly different parts and connections of the incorporated network are improving and affecting the classifier. We propose an approach that extends the "network-induced classification kernel"-method by Lavi et al. to extract the parts of the network that play an important role on improving the classification performance and to find the genes which have an impact on the classification. Based on the weights of the trained classifier we extract a score for gene pairs that are close in the given network to infer parts of the network that have higher influence on the classification task. Using these pair scores, we propose a scoring metric to order genes. On synthetic data, we show that this scoring metric ranks network-related implanted genes higher for the classifier that uses the network than for the classifier that does not use the network. Additionally, on publicly available expression data from a cancer study, we find that the top ranked genes from the classifier that uses the network tend to have more support from the biological literature.
TOP
B52 - BROOMM: an application to aid curation of biomass reactions within metabolic models
Short Abstract: Genome-scale metabolic models can now be automatically created using various databases and published software. Nevertheless, their use is limited because they usually contain generic biomass reactions with incorrect stoichiometries. Manual curation of these biomass reactions is required in order to obtain accurate results when carrying out quantitative analyses of these models, such as flux balance analysis (FBA); however, researchers with the required expert knowledge of cellular biochemistry are often unfamiliar with editing models in the Systems Biology Markup Language (SBML) format. Here we present BROOMM (Biomass Reaction Operations On Metabolic Models), a desktop application that allows users to build or manipulate biomass reactions within genome-scale metabolic models without needing to access the XML files directly. Furthermore, it provides a much-needed way to transfer information between different models in an efficient, semi-automated process for fast model curation.
TOP
B53 - KNU Genotyping Tool for Genome Sequencing Data for GWAS
Short Abstract: The KNU Genotyping Tool is a program which is to conveniently extract single nucleotide polymorphism (SNP) genotype data from multiple samples of genome sequencing data for genome-wide association study (GWAS).
Usually, we should go through several steps to extract the genotype data form genome sequence data. At each step, various methods can be used leading to different results depending on which method to use. Furthermore, it is not easy to apply such methods properly.
Even if there exist several genotyping tools for genome sequencing data, they usually perform not from raw genome sequencing data.
Thus, we introduce the KNU genotyping tool, which creates a series of several modular tasks (called a process stream) to produce genotype data from genome sequencing within a unified framework. The process stream help us to obtain reliable genotypes of the known SNP positions as much as possible, within a reasonable time. In addition, the tool can handle multiple samples at a time.
Our implementation consists of two different group of modules. One group of modules are related to the task of converting genome sequence data to variant call format (VCF), and other group of modules are related to the task of converting VCF to pedigree format. The pedigree format is compatible with many GWAS tools.
Consequently, KNU genotyping tool can make users convert multiple samples of genome sequencing data into SNP genotype data by following the recommended procedure (i.e., process stream) with reference to the known SNP positions.
TOP
B54 - DaGO-Fun: Tool for Gene Ontology-based functional analysis enhanced through semantic similarity measures
Short Abstract: The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. We have developed DaGO-Fun, an online tool which integrates topology and annotation information content (IC) based GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content, making it fast in executing user queries. The DaGO-Fun online tool, available at http://web.cbio.uct.ac.za/ITGOM, presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to allow effective exploration of all these measures, thus enabling users to choose the relevant approach according to their applications. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, namely the retrieval of genes based of their GO annotations, the clustering of functionally related genes within a set, and GO term enrichment analysis. DaGO-Fun provides a user friendly interface, that is easy to use, flexible and customizable, taking into account user preferences.
TOP
B55 - Galaxy LIMS for next-generation sequencing
Short Abstract: We have developed a laboratory information management system (LIMS) for a next-generation sequencing (NGS) laboratory within the existing Galaxy platform. The system provides lab technicians standard and customizable sample information forms, barcoded submission forms, tracking of input sample quality, multiplex-capable automatic flow cell design and automatically generated sample sheets to aid physical flow cell preparation. In addition, the platform provides the researcher with a user-friendly interface to create a request, submit accompanying samples, upload sample quality measurements and access to the sequencing results. As the LIMS is within the Galaxy platform, the researcher has access to all Galaxy analysis tools and workflows. The system reports requests and associated information to a message queuing system, such that information can be posted and stored in external systems, such as a wiki. Through an API, raw sequencing results can be automatically pre-processed and uploaded to the appropriate request folder. Developed for the Illumina HiSeq 2000 instrument, many features are directly applicable to other instruments.
TOP
B56 - Pinv: Protein Interaction Network Visualizer
Short Abstract: As of April 2013, Homo Sapiens had over 120000 interactions registered in the Interologous Interaction Database (http://ophid.utoronto.ca/ophidv2.204/statistics.jsp). This exemplifies the volume of data that research on the topic deals with.
Visualization of protein-protein interaction data is a well known strategy to make sense of the information. There are several software tools that provide this functionality, for example Cytoscape (http://www.cytoscape.org/) and NAVIGaTOR (http://ophid.utoronto.ca/navigator/).
Each of these tools provide different methods to visualize protein network interactions, and their pros and cons have been discussed in several earlier reviews. However, there is still no native web tool that allows this data to be explored online.
We consider that, given the advances that web technologies have made recently, it is the time to bring these functionalities to the web to provide an easily accessible forum to visualize interactions.
We have developed PINV (http://biosual.cbio.uct.ac.za/pinv), an open source, native web application that facilitates the visualization of protein-protein interactions.
PINV components follow the protocol defined in BioJS (http://www.ebi.ac.uk/Tools/biojs/) and have been shared with its community. The main component is the graphic that represents the network. It makes use of D3 (Data-Driven Documents), a javascript library for web visualization of data (http://d3js.org/).
The resultant tool provides an attractive view of complex, fully interactive networks with a set of tools that allow the querying, filtering and manipulation of the visible subset.
TOP
B57 - SIMtoEXP: software for comparing simulations to experimental scattering data
Short Abstract: In recent years, high power neutron and X-ray scattering experiments have become important tools for investigating lipid bilayer systems found in biological membranes. These methods provide structural information for the bilayers by converting the scattering results into structure factors that provide some structural information, but are unfortunately limited in their scope and scale. The limitations arise due to the fluid nature of lipid bilayers, that unlike crystalline material do not have well defined periodic lattices and thus only produce broad peaked structure factors with essentially no long range order. A number of theoretical models have been developed to convert these structure factors into electron density and neutron scattering density functions across the bilayer, but these models are based on numerous assumptions and there is no way to confirm their correctness. To overcome the limitations of these models, the Simulation to Experiment (SIMtoEXP) software was developed. It converts simulation probability densities into structure factors for direct comparison with experiment, thus providing atomic level detail to the scattering results. Here we present an extended version of SIMtoEXP, rewritten in C++ and the 'Qt' GUI library in lieu of the original C/Tcl combination. A major extension has been added that reads molecular dynamics trajectories directly and calculates atomic probability densities across the bilayer. This eliminates much work for the user, and removes possible errors introduced through the calculation of the probability densities.
TOP
B58 - Web 2.0 Electronic Laboratory Notebook (Elegance) for Biomedical Research Community on Sharing, Co-working and Inspiriting
Short Abstract: Elegancy, a web 2.0 electronic Lab notebook developed by our team since 2007, enables scientists without profound computer skills to build the electronic laboratory records quickly, sort notes, compare results and share information to others via various browsers on different platforms and mobile devices. Based on Drupal, an open source content management system, and our own developed modules, we exploited apache, mySQL, php, Java and Ajax to construct web-based ELN for Linux, Windows, and MAC. The essential functions inside ELN will be included 1) friendly installation with few clicks, 2) notes creating with attached experimental digital outputs, 3) full text search with image gallery, 4) succinct user management with digital signature, 5) automatic system backup on whole ELN, 6) calendar with coming event notification, 7) personalized interface with privacy, 8) data sharing and exchange via web, 9) high availability on function extension, etc. More detail about the functions in ELN can be found in our help with five minutes of screencasts. Once users install Elegancy, two websites for public and intra-group will be ready for use automatically. Most importantly, we have designed a content exchanging mechanism to allow two independent ELNs for communication to deep the collaborations.
In brief, we believe such kind of system developed by our team will really help research community on supporting interventions, sharing information, re-organizing knowledge, and manifesting actual laboratory works.
>Screencast: http://eln.iis.sinica.edu.tw/eln/?q=help
>Download site (for Windows/ MAC) of ELN: http://eln.iis.sinica.edu.tw/eln/?q=home
TOP
B59 - Integrated reconstruction of rearrangement and copy number state of highly aneuploid tumor samples
Short Abstract: The copy number and rearrangement states of cancer genomes are usually estimated separately. Copy number state is inferred using statistical segmentation of read-depth of whole genome sequence data, while rearrangements are identified through paired-end and split-read mapping. Previous methods for integrating these two signals rely on the (usually false) assumption that tumor samples consist purely of cancer cells (e.g. cell lines) and that malignant cells are diploid. As a result, these approaches are not robust to biological and technical features of complex tumor data, in particular that obtained from epithelial cancers. We introduce a mixed integer-programming framework for joint inference of purity, ploidy, and absolute copy number on nodes (signed reference intervals) and edges (reference and aberrant genomic adjacencies) of a tumor genome graph. We demonstrate how our integrated approach improves both copy number and rearrangement estimates relative to methods that consider these signals separately. We also show how our framework is robust to the noise modes seen in complex cancer tissues, relative to previously reported integrated approaches. Finally, we show how a simple extension of our framework can be used to identify candidate rearrangements that “complete” a candidate reconstruction. This is useful, since somatic rearrangement analyses often suffer from false negatives (i.e. due to low coverage, tumor impurity, limited short-read mappability of rearrangement breakpoints), yielding gaps in the cancer genome reconstruction. We apply our approach to published WGS data from several tumor types.
TOP
B60 - Tools for the correction, analysis and visualization of next generation sequencing data
Short Abstract: Currently, post-processing of aligned next-generation sequencing data for visualization or correction is inefficiently carried out using a combination of numerous shell, Perl and R scripts. To make this process more reproducible, efficient and also accessible to less experienced users we have developed a suite of software tools offering easy-to-use programs that take advantage of the multicore nature of today’s computer servers. We have integrated published methods to offer the following functionalities: the estimation and correction of the GC bias; the generation of wiggle files (bedgraph or bigwig) using different normalization methods; the computation of the differences, ratio or log2ratio between two samples including different scaling approaches; the generation of average signal profiles and heatmaps for large number of regions (e.g. genes, repetitive elements, transcription start sites); and the computation of the correlation between multiple files. These tools have been instrumental in the discovery and visualization of ChIP-seq binding patterns and nucleosome positioning and they are used on a daily basis by bioinformaticians in our institute as well as by experimental researchers. They have thus been thoroughly and continuously tested and are available either as standalone programs or integrated Galaxy modules.
TOP
B61 - BioXSD: An XML Schema for sequence data, features, alignments, and identifiers
Short Abstract: Exchange of data between various tools benefits from using common formats. Textual, tab-separated formats are convenient when used in text-processing frameworks, while RDF is advantageous within the Semantic Web. In parallel, XML Schema-based formats have the advantage of enabling both textual and binary representation, and are convenient to use in object-oriented frameworks or with Web services.
BioXSD has been developed to fill the niche of missing canonical XML Schema-based exchange format for basic bioinformatics data: sequences, alignments, features, references to resources, and identifiers. The canonical format can either be consumed and produced by tools directly, or serve as an intermediate format other formats can be converted to and from. Current version 1.1 of BioXSD includes improvements in optimizing the data volume, and in allowing highly flexible annotation -- including for example feature semantics, complex scoring, and provenance metadata. In addition, the semantic annotation of the BioXSD Schema itself has been improved. It currently includes annotations with EDAM, Dublin Core, and RDFS, supporting semantic reasoning when integrating data and translating between formats.
Conversion between BioXSD and other formats is the main focus of our ongoing work on BioXSD, together with improving the representation of variation data, genome-scale alignments, sequence profiles and patterns, and involving with the community in feedback and implementations.
TOP
B62 - Medilaxy: A Galaxy Platform For Medical Image Analysis
Short Abstract: We present Medilaxy, a workflow system for investigating and modeling of diseases of the central nervous system based on the Galaxy framework. At the moment, Medilaxy is focused on Magnetic Resonances Imaging data, which is processed in order to study spatially varying properties, such as fractional anisotropy or diffusivity along brain fibers. In a current application scenario, the aim is looking for statistically significant differences in the relevant quantities along fibers crossing MS lesions. Here, statistically significant differences were found in fibers belonging to patients or controls or in fibers belonging to patients but not crossing lesions (Normal Appearing White Matter); different behaviors were also found for lesions belonging to distinct regions of the brain.
To support input from various sources, all data is stored in a database; queries, evaluation of relevant quantities, statistic tests and plots are all performed by means of Python scripts and modules – NumPy for statistics and matplotlib for plots, respectively. All operations can be performed using the standard Galaxy XML GUI, insulating the user from programming details unless specific customization (like file format, fonts selection, colors or labels in the plots) are needed. The flowchart feature of Galaxy allows quick, easy and intuitive implementation of complex and repetitive tasks; for reporting results, both tables of p-values and plots are produced and included in a single LaTeX file for quick generation of drafts for presentation or papers.
TOP
View Posters By Category
TOP