ISMB/ECCB 2015 Special Sessions

SST01: Translational Medicine Informatics
SST02: Algorithms, Machine Learning and Data Complexity: From Chromatin Interactions to Nuclear Function
SST03: Towards Unifying a Computational Biology Ecosystem Across Europe and the US
SST04: Crowd-Sourced Benchmarking of Somatic Mutation Calling

Attention Conference Presenters - please review the Speaker Information Page available here.

SST01:  Translational Medicine Informatics Monday, July 13, 10:10 am – 12:40 pm
Room: The Auditorium


Organizer(s):

Bio:

Venkata P. Satagopam is a Research Scientist at Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg actively involved in the Clinical and Translational Research.  He  is  a  co-founder  of  the  ISCB  student  council,  involved  in  the organization  of  several  student  council  symposiums,  initiated  internships  for  students  from developing nations etc. He is the co-chair of the ISMB/ECCB 2011 Killer App Award committee, one of the organizing committee member of the ISMB poster session, Arts & Science and co-organizer of the similar workshop in ISMB/ECCB 2011; grant writing tutorial and also involved in the organization of another workshop  in  ISMB  2012  “P2P  –  From Postdoc  to  Principal  investigator” and  in  ISMB/ECCB 2013, ISMB 2014 organized “JPI – Junior Principal Investigator” meeting. He is also involved in the organization of the biohackathons, Garuda 5, 6 and active participant of several scientific conferences. 

Bio:

Mansoor Saqi is a Senior Researcher at the European Institute for Systems Biology and Medicine in Lyon, France and is working on eTRIKS. Previously he was Principal Investigator in Bioinformatics at the Department of Computational and Systems Biology at Rothamsted Research, UK, and has worked in both academic  and  industrial  settings. His work  has  covered  a  number  of  application  domains, including  sequence  and  structural  bioinformatics,  pathways,  data  integration  and  the  analysis  of integrated biological networks.

Bio:

Reinhard  Schneider  is  a  Head  of  the  Bioinformatics  Core  facility  at  the  Luxembourg  Centre  for Systems Biomedicine  (LCSB)  at  the University of  Luxembourg. Between 1994I2010 he worked as a Team Leader at the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. Before joining  the  EMBL  he was  cofounder  of  LION  bioscience AG, Heidelberg where  he  served  as  Chief Information Office. Before  founding  LION bioscience, he worked as a  scientist  in  the Biocomputing department  at  the  EMBL, Heidelberg, where  he  studied  various  aspects  of  protein  structures.  Dr. Schneider received his Ph.D. in Biology at  the University  of Heidelberg, Germany  and  has  over  90 research papers published. He is a member of the executive committee of the International Society for Computational Biology where he serves as the treasurer and chairs the Governance, Fundraising and Finance Committee. He was  organizer  and  co-organizer  of  several  conferences  (ISMB  2014  to ISMB 2007,  ISCBIAsia/SCCG 2012,  ISMB  Latin America 2012, VizBi, BBC11, Garuda 5, 6). Beside his academic career, he is involved in several startup projects. 

Bio:

Wei Gu is a Postdoctoral Researcher at the Bioinformatics Core facility at the Luxembourg Centre for Systems Biomedicine (LCSB) at the University of Luxembourg. A key player in the IMI projects eTRIKS and AETIONOMY in data curation and integration, hosting, analytics and visualization of clinical and multi-omics data. He also plays an important role in other IMI projects (through eTRIKS): ONCOTRACK, ABIRISK and APPROACH in terms of data curation and integration as well platform deployments. Dr. Gu received his Ph.D. in bioinformatics at the Centre for Bioinformatics (CBI), Saarland University, Germany in 2008. He has (co-) authored more than 20 scientific publications. Before joining LCSB, he worked as a bioinformatics/biostatistics scientist and a member of the IT support team at CBI Saarland and University Hospital of Saarland.

Bio:

Irina Roznovat is a Postdoctoral Researcher at the European Institute for Systems Biology and Medicine in Lyon, France and is working on eTRIKS. She holds a PhD in Computer Science (Computational Biology) from Dublin City University, Ireland and has worked on integrating information on genetic/epigenetic interdependencies, signalling pathways, stem cell dynamics, ageing/gender influences and epigenetic inhibitors, to develop a multi-scale computational model for colon cancer dynamics. Her main research interests are complex systems modelling (epigenetics, cancer, neurodegenerative disorders), machine learning, data analysis, concurrent programming. She is a co-organizer of the ‘Empowering Systems Medicine Through Optimal Computational Modelling’ Workshop, held in conjunction with IEEE BIBM 2014, Belfast, UK, Nov. 2014. 

 

Presentation Overview:

 

In  this session,  we  will  discuss  the  current  status  of  computational  biology  approaches within the field of clinial and translational medicine. Large amounts of multi ‘omics and clinical data can now be captured for given patient populations. The molecular data, generated from high throughput experiments, includes data relating to gene expression, copy number variation and single nucleotide polymorphisms. Harmonization of retrospective and prospective clinical data from serveral studies and application of controlled terminologies and standards in order to facilitate cross study comparitions is a challenge. A variety of computational approaches are currently  being  used  to  harmonize and relate  molecular  data  to  clinical  outcomes  in  order  to  better  understand disease conditions. These methods also have the potential to be used predictively to help to suggest personalised therapeutic strategies for patients.

The  session  comprises  of  four  reports  of  20 minutes  duration  with  subsequent  5 min discussion  or  time  for  questions.  These  four  topics  will  (a)  give  an  overview  of  the importance of computational methods in translational medicine and recent progress in this area  (b)  address the context related to data curation  (c)  describe  the recent  European  initiative  eTRIKS  (European  Translational  Information  and  Knowledge Management Services) (d) present an application case that illustrates the value of combining carefully curated clinical data with high dimensional ‘omics data. 

 

 

Part A: Translational Medicine: the current landscape and future directions
Bio:

Dr. Winston Hide Trained at Temple University , post docced with Wen-Hsuing Li, Richard Gibbs and also at the Smithsonian Museum of Natural History in Washington DC. He founded the South African National Bioinformatics Institute in 1996. Recognised with an outstanding achievement award by the ISCB for his work in establishing Bioinformatics in Africa, Hide has recently driven strategic development for bioinformatics at Harvard’s School of Public Health and Stem Cell Institute - increasingly focusing on translation. He now leads the MSc Programme for genome medicine at Sheffield and is establishing a centre for genome translation at the University.

Session Description:
We are now rapidly moving from single human genomes to deca-, centi- and even milligenome projects. With more ways to compare gene variation against a background, comes new methods to select variants and genes for their potential in prediction and impact for a disease. Gene hunting is still very much a fashion and genes represent tempting targets for drug development. But like David Bowie we need to push the boundaries to embrace the growing realisation that genes work in cohorts and it is the interaction of these cohorts that drive the disease phenotype. Identifying and targeting pathways and processes that drive disease is the new black. To action discovery, we need to address ways in which to benchmark selection of disease genes, pathways and processes. In turn we need to develop more efficient (read less ineffective) ways to select therapeutics that are likely to be acceptable for real health interventions. The talk will present how we address these challenges through commoning for data repository, provenance, reproducibility and workflows, benchmarks for assessment of approaches, standardisation for pathway activity, and new approaches to discovering the relationships between pathway interaction, genome variation and disease modelling. 
Part B: Standars and ontologies in harmonization of clinical and ‘omics data
Bio:
Dr. Rocca-Serra is a Senior Research Lecturer at the University of Oxford since 2010, after 8 years at the European Bioinformatics Institute in Cambridge, UK. Dr. Rocca-Serra‘s experience ranges from data management of complex, multi-omic datasets to development of open source, community-based syntactic and semantic artifacts such as the Ontology for Biomedical Investigation (http://obi-ontology.org, since 2004) and the ontology for statistical methods (http://www.stato-ontology.org).  He is member of the OBO Foundry. He is also co-funder of the BioSharing initiative (http://www.biosharing.org), engaging with publishers, librarians, researchers. He co-leads the ISA project (http://isa-tools.org) and coordinates its user community, including two leading data publishers (BioMedCentral’s GigaScience; Nature Publishing Group’s Scientific Data), institutional and global repository (e.g. Harvard Stem Cell Commons, EBI Metabolights). Via a pharma-driven project (eTRIKS), Dr Rocca-Serra works with clinical standards such as CDISC to improve links among clinical, non-clinical and community-driven omic standards for the representation of biomedical data. He is also part of the NIH BD2K Centre for Expanded Data Annotation and Retrieval (CEDAR), focused on the creation a framework for standards-driven metadata templates.
Session Description:

Scientific communication and scholarly publishing more generally are evolving rapidly under the pressure from funding agencies, regulatory agencies, publishers and the general public for removing the obstacle to data access and to facilitate assessment. This requires having data communication standards in place but that alone is not enough. In this short presentation, we will provide a overview of the landscape of resources relevant to clinical research as well as the latest progress in the field of functional genomics data standards, showing how resources such as Biosharing.org, CDISC standards and ISA format can be harnessed. Following an outline of the key issues, we will share the experience and the lessons learned as part of the IMI eTRIKS initiative the standardization and curation activities involved to improve and facilitate translational research. We will also discuss the opportunities for collaboration and cooperation with other major initiative worldwide, such as the NIH big data to knowledge initiative (NIH BD2K).

Part C: eTRIKS, European Translational Information and Knowledge Management Services
Bio:

David Henderson, PhD is Principal Scientist and Liaison Manager in Global External Innovation and Alliances at Bayer Pharma AG in Berlin, Germany. He earned his B.Sc from the University of Edinburgh and his PhD in Molecular Biology from Vanderbilt University. With over 30 years’ experience in drug discovery and development, he has worked on drug development and biomarker studies in clinical trials in oncology, ranging from Phase I to Phase III. In his present position, he is Liaison Manager for Bayer’s contributions to several projects funded by the Innovative Medicines Initiative (IMI-JU) and acts as Coordinator of the OncoTrack consortium.

Session Description:

eTRIKS (European Translational Research, Informatics and Knowledge Management Services) is a public private partnership made up of 17 pharmaceutical companies, academic institutions and SMEs, jointly financed by the Pharma partners and the Innovative Medicines Initiative of the EU (IMI-JU), with expertise in data management, systems biology, biomedical curation, collaboration and data exchange standards. The goal of the consortium is to leverage the open source transMART platform to provide a series of services to Translational Research and Biomarker research programs, enabling disease stratification and biomarker discovery by:

  • Driving the adoption of a common open source platform
  • Promoting multi-study data harmonisation
  • Developing best practice guidelines and resources for the re-use of research data
  • Providing advice and support for translational research projects

In this presentation, we shall describe and outline the current state of the eTRIKS project, illustrate how we are working to support ongoing research projects and how we are providing a hub for a growing ecosystem of open source informatics technologies and providers to support the translational research community.

Part D: Data Integration and Analytics in Translational Medicine
Bio:

Dr.Anna Goldenberg is a Scientist in Genetics and Genome Biology program at the SickKids Research Institute and an Assistant Professor in the Department of Computer Science at the University of Toronto. Dr Goldenberg has obtained her PhD in Machine Learning from Carnegie Mellon University developing efficient methods for structural learning in graphical models in application to social networks. She has then immersed herself in the field of computational medicine first as a postdoc at UPenn and later as a postdoc at UofT's Donnelly Centre for Cellular and Biomolecular Research. Her current research focuses on developing novel machine learning methods for genomic and clinical data integration, addressing heterogeneity and identifying disease mechanisms in complex human diseases.

Session Description:
Rapidly evolving technologies are making it progressively easier to collect multiple and diverse genome-scale data sets to address clinical and biological questions. How do we take advantage of the omic data deluge to help patients? In this talk I will survey the field of data integration methods and then introduce patient networks - a flexible and powerful platform for data integration. I will describe Similarity Network Fusion (SNF) and illustrate its power by integrating mRNA expression, DNA methylation and miRNA expression in 5 different cancers. I will show how to use SNF in a realistic setting where the cohorts of patients are only partially overlapping. One of the key properties of patient networks is the ability to integrate very different kinds of data – from imaging to genetic to questionnaire data for the same set of patients. Yet another source of power is the ability to fine map the patient population. Using this paradigm and a new formulation for the Cox survival analysis I will show a substantial improvement in survival risk prediction compared to the regular subtyping analysis in breast cancer. Finally, I will discuss the implications and challenges of translating computational approaches for data integration to clinical use.
top

 

SST02:  Algorithms, Machine Learning and Data Complexity: From Chromatin Interactions to Nuclear Function Monday, July 13, 10:10 am – 12:40 pm
Room: Wicklow Hall 2A


Organizer(s):

Pietro Lio, University of Cambridge, United Kingdom
Yoli Shavit, University of Cambridge, United Kingdom

 

Presentation Overview:

 

 

Part A: Chromosome organization & polymer entanglements: Insights from computer simulations (10:10 am-10:30 am)
Speaker: Angelo Rosa, Scuola Internazionale Superiore di Studi Avanzati, Italy
Session Description:

Very recently, conspicuous effort has been dedicated to describing and predicting the three-dimensional organization of chromosomes inside the eukaryotic nucleus by generic polymer models [1]. In my talk, I will discuss recent results showing that chromosome structure and dynamics can be quantitatively described by a polymer model of decondensing chromosomes which takes into account only minimal physical ingredients like density, stiffness and topology conservation of the chromatin fiber [2-4]. Then, I will present preliminary results concerning how this model can be (1) employed in order to investigate the origin of the visco-elastic properties of the nucleus [5], and (2) suitably generalized for studying the consequences on chromosome structure arising from a mixed composition (10nm-fibers vs. 30nm-fibers) of the underlying chromatin fiber [6].

References:

  1. A. Rosa and C. Zimmer, Int. Rev. Cell & Mol. Biol. 307, 275 (2014).
  2. A. Rosa and R. Everaers, Plos Comput. Biol. 4, e1000153 (2008).
  3. A. Rosa et al., Biophys. J. 98, 2410 (2010).
  4. A. Rosa and R. Everaers, Phys. Rev. Lett. 112, 118302 (2014).
  5. M. Valet and A. Rosa, in preparation.
  6. A.-M. Florescu and A. Rosa, in preparation.
Part B: Timing in the nucleus: replication domains and graph theory (10:30 am-10:50 am)
Speaker: Benjamin Audit, CNRS, France
Session Description:

Recent papers have investigated the link between replication timing, replication domains and topologically associated domains. Graph theory, machine learning and signal processing are key examples for methods applied for segmenting the genome based on contact frequency profiles, and for further linking it with replication domains. This presentation will discuss the latest algorithmic advancements and challenges in genome segmentation in the context of investigating the link between replication dynamics and chromatin folding.

Part C: Functionally guided multi-omic integration towards biomarker discovery (10:50 am-11:10 am)
Speaker: Syed Haider, University of Oxford, United Kingdom
Session Description:

An important concept in biology is the link between structure and function. For example, 'sequence makes structure makes function' is a key idea in protein folding. This presentation will address the concept of 'geometry makes function, function makes geometry'. It will present the latest methods for data integration of chromatin conformation and multi omic data, highlighting key results and open challenges.

Part D: Assessing the limits of restraint-based 3D modeling of genomes and genomic domains (11:40 am-12:00 (Noon))
Speaker: Marc Martí-Renom, CREA, CNAG, CRG, Spain
Session Description:

Restraint-based modeling of genomes has been recently explored with the advent of Chromosome Conformation Capture (3C-based) experiments. We previously developed a reconstruction method to resolve the 3D architecture of both prokaryotic and eukaryotic genomes using 3C-based data. These models were congruent with fluorescent imaging validation. However, the limits of such methods have not systematically been assessed. Here we propose the first evaluation of a mean field restraint-based reconstruction of genomes by considering diverse chromosome architectures and different levels of data noise and structural variability. The results show that: first, current scoring functions for 3D reconstruction correlate with the accuracy of the models; second, reconstructed models are robust to noise but sensitive to structural variability; third, the local structure organization of genomes, such as Topologically Associating Domains, results in more accurate models; fourth, to a certain extent, the models capture the intrinsic structural variability in the input matrices; and fifth, the accuracy of the models can be a priori predicted by analyzing the properties of the interaction matrices. In summary, our work provides a systematic analysis of the limitations of a mean-field restrain-based method, which could be taken into consideration in further development of methods as well as their applications.

Part E: Mining chromatin interactions: challenges in data integration and classification (12:00 (Noon)-12:20 pm)
Speaker: Yoli Shavit, University of Cambridge, United Kingdom
Session Description:

3C, 4C and more recently 5C and Hi-C data, were shown to be important for classification of disease taxonomy (for example, in leukaemia).To date, however, there is little established methodology for automated classification and inference. Thus, while imaging data of nuclear architecture, such as Fluorescent In Situ Hybridization (FISH) are commonly used for clinical applications, chromosome conformation data are still far from being adopted. This presentation will discuss some of the computational and bioinformatics challenges involved in data integration and data mining of chromosome conformation data and in their spatial interpretation, towards potential avenues for clinical applications.

Part F: Q&A panel session with speakers (12:20 pm-12:40 pm)
Session Description:

An opportunity to address questions and open discussion with Special Session speakers.

top

 

SST03:  Towards Unifying a Computational Biology Ecosystem Across Europe and the US Monday, July 13, 2:00 pm – 3:00 pm
Room: Wicklow Hall 2A


Organizer(s):

Bio:

Philip Bourne is Associate Director for Data Science at the National Institute of Health (NIH) and leader of BD2K, a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement.

BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement. - See more at: http://bd2k.nih.gov/about_bd2k.html#sthash.Oj7rokEm.dpuf
BD2K is a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximize community engagement. - See more at: http://bd2k.nih.gov/about_bd2k.html#sthash.Oj7rokEm.dpuf
Bio:

Niklas Blomberg is Director of ELIXIR, the European infrastructure for biological information, based at the ELIXIR Hub located alongside the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) in Hinxton, UK.

 

Presentation Overview:

 

The session will examine the relationship between the Big Data to Knowledge (BD2K) initiative and ELIXIR. BD2K has been established by the National Institute of Health (NIH) to enable biomedical research as a digital research enterprise, facilitate discovery and support new knowledge, and to maximise community engagement in the US. ELIXIR, the European infrastructure for biological information, is the distributed effort of partners across Europe to coordinate, sustain and integrate data resources, bio-computing capacity, analysis tools, training and standards for the research community. The two initiatives share many parallels and plan close cooperation over the coming years.

The session will start with representatives from ELIXIR and BD2K talking about their respective priorities and programmes. It will conclude with a panel discussion comparing NIH data management practices and 'data commons’ approach, with the more federated European approach. It will assess the added value to be gained in aligning efforts between the initiatives and provide the opportunity for lively discussion and audience questions to senior representatives from BD2K and ELIXIR Nodes.

The panel session will be chaired by Prof. Nicola Mulder from the Institute of Infectious Disease and Molecular Medicine at the University of Cape Town in South Africa.

 

top

 

SST04:  Crowd-Sourced Benchmarking of Somatic Mutation Calling Tuesday, July 14, 10:10 am – 11:10 am
Room: The Auditorium


Organizer(s):

Bio:

Dr. Boutros is an independent investigator in the Informatics and Biocomputing Platform of the Ontario Institute for Cancer Research in Toronto. He received his BSc (Chemistry) from the University of Waterloo and his PhD in Medical Biophysics from the University of Toronto. He has received several awards, including the CIHR/Next Generation First Prize. His research group focuses on using new DNA sequencing technologies to improve diagnosis and treatment of prostate cancer. Paul co-leads both the Canadian Prostate Cancer Genome Network and the ICGC-TCGA DREAM Somatic Mutation Calling Challenge.

Bio:

Dr. Lee is a Bioinformatician in the Boutros laboratory of the Ontario Institute for Cancer Research in Toronto. She received her BMath from the University of Waterloo and her PhD from McGill University, both in Computer Science with a focus on Bioinformatics. She also completed a postdoctoral fellowship involving chemical genomics with Drs. Guri Giaever and Corey Nislow at the University of Toronto.

 

Presentation Overview:

 

The analysis of cancer genome-sequencing data remains a significant challenge. Accurate and rapid identification of somatic mutations of all types – point-mutations and structural variants – is quickly becoming the key limiting step in data-analysis. However, the lack of accepted benchmarks has slowed the adoption of community standards and hindered the evolution of best-in-class methods through collaborative efforts.

The two largest international cancer genomics efforts – the Cancer Genome Atlas (TCGA) and the International Cancer Genomics Consortium (ICGC) – joined forces to launch the ICGC-TCGA DREAM Somatic Mutation Calling Challenge: a crowd-sourcing effort to identify the best pipelines for detecting mutations in the high-throughput sequencing reads of cancer genomes (https://www.synapse.org/#!Synapse:syn312572). The Challenge is part of the DREAM series of open challenges in computational biology, and is divided into sub-challenges focused on specific aspects of mutation calling.

As organizers of the Challenge, we will present the results of community efforts to create benchmarks for mutation calling. In addition, a winning structural variant detection method, novoBreak, will be presented by Dr. Ken Chen from the University of Texas MD Anderson Cancer Center.

 

 

Part A: Single Nucleotide Variant Detection in DNA (10:10 am-10:30 am)
Bio:

Dr. Boutros is an independent investigator in the Informatics and Biocomputing Platform of the Ontario Institute for Cancer Research in Toronto. He received his BSc (Chemistry) from the University of Waterloo and his PhD in Medical Biophysics from the University of Toronto. He has received several awards, including the CIHR/Next Generation First Prize. His research group focuses on using new DNA sequencing technologies to improve diagnosis and treatment of prostate cancer. Paul co-leads both the Canadian Prostate Cancer Genome Network and the ICGC-TCGA DREAM Somatic Mutation Calling Challenge.

Session Description:

Benchmarking is needed for tool assessment and improvement but is complicated by a lack of gold standards, by extensive resource requirements and by difficulties in sharing personal genomic information. To resolve these issues, we launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge. The BAMSurgeon tool for simulating cancer genomes, and the results of 248 single nucleotide variant analyses of three in silico tumors created with it, will be presented. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide profile very similar to one found in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in subclonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. BAMSurgeon is available at https://github.com/adamewing/bamsurgeon/.

 

Part B: Structural Variant Detection in DNA (10:30 am-10:50 am)
Bio:

Dr. Lee is a Bioinformatician in the Boutros laboratory of the Ontario Institute for Cancer Research in Toronto. She received her BMath from the University of Waterloo and her PhD from McGill University, both in Computer Science with a focus on Bioinformatics. She also completed a postdoctoral fellowship involving chemical genomics with Drs. Guri Giaever and Corey Nislow at the University of Toronto.

Session Description:

The ICGC-TCGA DREAM Somatic Mutation Calling Challenge has highlighted the many difficulties in benchmarking and scoring structural variant detection, and has revealed areas of improving detection algorithms. The Challenge includes 206 structural variant analyses of three in silico tumors (created with BAMSurgeon). Different approaches to scoring these analyses for accuracy, and to defining ensembles of these analyses, will be presented. While different structural variant detection algorithms exhibit characteristic error profiles, errors generally tend to be associated with low mapping quality at the predicted breakpoints.

Part C: novoBreak: robust characterization of structural breakpoints in cancer genomes (10:50 am-11:10 am)
Bio:

Dr. Chen has a background in machine learning, statistical signal processing, and cancer genomics. He has developed a set of computational tools such as BreakDancer, TIGRA, CREST, and VarScan that have been applied to characterize individual and population genomics in the Cancer Genome Atlas (TCGA) and the 1000 Genomes Project.  He is particularly interested in constructing the genomes and the transcriptomes of various cancer cell populations towards understanding the heterogeneity and the evolution of cancer as a consequence of genetics and environment.  He is also interested in developing integrative approaches to identify biomarkers that are useful for clinical decision support.

Session Description:

Structural variation (SV) is a major source of genomic variation and plays an important role in cancer genome evolution. However, due to the methodological limitations in aligning and interpreting short reads spanning breakpoints, current methods cannot achieve a high sensitivity and precision. Here, we present a novel algorithm, novoBreak, which comprehensively characterizes a variety of structural breakpoints at base-pair resolution. novoBreak first chops tumor reads into k-mers and indexes them; then by filtering against reference and normal reads, it derives tumor-specific k-mers (novo-kmers); next it clusters reads with the same breakpoints based on the novo-kmers and then locally assembles the reads associated with each breakpoint into contigs. After aligning the contigs to the reference, novoBreak identifies the precise breakpoints and infers various types of SVs. novoBreak consistently performed best in the SV breakpoint calling subchallenges in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling challenge. The framework of novoBreak can also be applied to discover germline events, gene fusions in RNA-seq data and SV breakpoints in whole exome data. The wider application of novoBreak is expected to reveal comprehensive structural landscape that can be linked to novel mechanistic signatures in cancer genomes.

top