Home

Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

Taxonomic Names and Metadata: A Framework for Big Data Interoperability

Schedule subject to change.
All times in Central Daylight Time (CDT)

Thursday, May 13^th

9:00-10:00

Keynote

Michael Osterholm

11:00-11:15

Taxonomic Names and Metadata Codes for Biodiversity Specimens: the Rules and the Road

Xiaojun Wang, Tulane University, United States
Yasin Bakis, Tulane University, United States
Henry Bart, Tulane University, United States

Presentation Overview: Show

11:15-11:30

Taxon Names Global to Local: Uses, Issues, Potential

Matthew Yoder, University of Illinois Urbana Champaign, United States
Deborah Paul, University of Illinois Urbana Champaign, United States

Presentation Overview: Show

Humans do the work to reveal the interconnectedness of life on the planet that supports it. Taxonomic names and related metadata and media underpin this critical research. These scientific endeavors across disciplines support local, regional, and global initiatives such as species discovery, conservation assessments, restoration work, science communication, agricultural crop science, ecological and biological associations investigation, preventing, mitigating, and predicting pathogen-related spill-over events, and discovering, understanding, and protecting natural resources.

Data organized, standardized, and aggregated taxonomically contribute to our efforts to address small and large-scale science and synthetic research (e.g. invasive species, human encroachment, extinction prediction, water quality monitoring, niche modeling, etc.). Taxonomic names help us manage and access related data in spreadsheets and in our respective databases for collection management, lab data, aggregated biodiversity data, and so on.

Taxon names (and related metadata) play roles at global, regional, institutional, and personal scales. For each, we offer some examples focusing on advances in development, best practices adopted and currently used, and opportunities for improvement. Globally, we will look at the Global Biodiversity Information Facility (GBIF), the Catalog of Life (CoL, https://www.catalogueoflife.org/), Global Names (http://globalnames.org/, https://github.com/gnames), the Biodiversity Information Standards (TDWG) Group, the Biodiversity Community Integrated Knowledge Library (BiCIKL), and Bionomia (https://bionomia.net/). Regionally, we check out iDigBio (https://www.idigbio.org/), DiSSCo (https://www.dissco.eu/), ITIS (https://www.itis.gov/), WoRMS (http://www.marinespecies.org/), TaxonWorks (https://taxonworks.org/), and iNaturalist (https://www.inaturalist.org/) and institutionally, we will examine the Biodiversity Heritage Library (BHL). Then, we share some current realities of every-person examples from several biodiversity disciplines including those who manage collections.

Specimens vouchered in scientific collections ground Taxonomy (Thompson CW 2021, Sharkey MJ 2021) and support reproducible research. Images of specimens offer opportunities to increase access to taxonomic information both by humans and computers. Rich (meta)data about these specimens and images make it possible for this information to be FAIR (findable, accessible, interoperable, reusable) and linked to unambiguously relate vouchered specimens, samples, and related sequences (Thompson CW 2021). Collectively these data create the requisite foundation for a robust biological science whose mission is increasingly focused on the genome (Thompson CW 2021).

Consider your own known scientific needs for and reliance on taxonomic information through the lens of a user story, for example: As a(n) disease ecologist, ichthyologist, computer scientist, microbiologist, science writer, instructor, etc., I need information (I) in format (F) to answer question (Q) for audience (A). Many known issues arise when using, tracking, and trying to keep up with taxon names and related (meta)data. We look forward to hearing about your particular experiences. Keeping your own requirements and expertise networks in mind, join us as we look at some specific examples of the ways in which taxon names and associated (meta)data support scientific work varying in both scale and scope and present challenges.

Sharkey MJ, Janzen DH, Hallwachs W, Chapman EG, Smith MA, Dapkey T, Brown A, Ratnasingham S, Naik S, Manjunath R, Perez K, Milton M, Hebert P, Shaw SR, Kittel RN, Solis MA, Metz MA, Goldstein PZ, Brown JW, Quicke DLJ, van Achterberg C, Brown BV, Burns JM (2021) Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species. ZooKeys 1013: 1–665. https://doi.org/10.3897/zookeys.1013.55600

Thompson CW, Phelps KL, Allard MW, Cook JA, Dunnum JL, Ferguson AW, Gelang M, Khan FAA, Paul DL, Reeder DM, Simmons NB, Vanhove MPM, Webala PW, Weksler M, Kilpatrick CW. 2021. Preserve a voucher specimen! The critical need for integrating natural history collections in infectious disease studies. mBio 12:e02698-20. https://doi.org/10.1128/mBio.02698-20.

11:30-11:45

Biodiversity informatics from rows to things, and semantics to practice

Matthew Yoder, Illinois Natural History Survey, Prairie Research Institute, University of Illinois., United States
Deborah Paul, Illinois Natural History Survey, Prairie Research Institute, University of Illinois, United States

Presentation Overview: Show

Since the advent of computers biodiversity experts have been using them to manage taxon names, their meta-data, and the vast data they point to. After an explosion of digital tools and methods proliferated, various standards, best practices, and shared services have emerged. These federating efforts facilitate data sharing and re-use, however they also place newfound constraints and demands on both developers and data curators, i.e. the scientists the tools seek to serve. Balancing the need to allow for growth and novelty with the need for standards and semantics and also prioritizing the production of tools that day-to-day scientists require is itself a major undertaking. Here we highlight a series of anecdotal examples that elucidate the complex nature of this enterprise.

First we focus on taxon names, the primary ""address"" of biological data. We clarify their role in the network of biodiversity data, describe cutting edge efforts to detect them at a global scale using supercomputers, the Hathi-Trust corpus, and the GlobalNames (https://gnames.org) initiative and also overview NOMEN (https://github.com/SpeciesFileGroup/nomen), an ontology that seeks to provide a logical-interpretation of the rules of zoological nomenclature. Names are not biological concepts, systems that fail to distinguish this fundamental difference are bound to be plagued by downstream issues that conflate nomenclatural synonymy with biological synonymy. This distinction, the need for clear boundaries, and an approach provided in TaxonWorks (https://taxonworks.org), a curatorial platform for biodiversity data is described.

Names are just one class of identifiers, i.e. metadata that ""localizes"" users and machines to related data. Modern systems must assume that things have multiple identifiers, allowing curators to reference and assert data via different standards. In practice systems that reference external standards require ORBs (object-request brokers) that provide temporary identifiers while fixed, more stable or permanent identifiers are discussed, refined, and minted.

All of these requirements, and many others like them, must ultimately be boiled down to tools that are usable, and understandable by scientists ""on the ground"". These tools must integrate knowledge across a wide range of domains, for example to cover the content produced in taxonomic monographs. We highlight our efforts in this area, the application workbench TaxonWorks, and demonstrate its ability to contribute a wide range of data, from highly semantic nomenclature, to images, specimen data, and annotations, to the knowledge-graph-of-life via multiple mechanisms including common standards, CSV, and JSON.

Finally, as we evolve from talking about rows of data to talking about digital ""things"" we make a plea for the biodiversity enterprise to adopt a questions-first perspective as a means of driving its evolution. For example ""To answer the question Q coming from audience A, I need (my) data D, packaged in format F"". This approach helps to ensure that exercises in abstraction (e.g. coming up with data-standards) are balanced with the actual needs of researchers on-the-ground. It also permits us to robustly answer the problem described to the best of our ability, then move forward to tackle the infinite number of next steps.

12:45-13:00

Systematic tissue annotations of –omic samples by modeling unstructured metadata

Arjun Krishnan, Michigan State University, United States
Nathaniel Hawkins, Michigan State University, United States
Marc Maldaver, Michigan State University, United States
Lindsay Guare, Michigan State University, United States

Presentation Overview: Show

There are currently >1.3 million human –omics samples that are publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, from this ever-growing data collection is still a significant challenge. The major impediment is that sample attributes such as tissue/cell-type of origin are routinely described using non-standard, varied terminologies written in unstructured natural language.

Here, we propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology.

Our approach significantly and substantially outperforms an advanced text annotation method (MetaSRA) that uses graph-based reasoning and a baseline method (Tagger) that annotates text based on exact string matching. We demonstrate the biological interpretability of tissue NLP-ML models using an analysis of their similarity to each other and an evaluation of their ability to classify tissue- and disease-associated biological processes based on their text descriptions alone.

Previous studies have shown that the molecular profiles associated with –omics samples are highly predictive of a variety of sample attributes. Using transcriptome data, we show that NLP-ML models can be nearly as accurate as expression-based models in predicting sample tissue annotations. However, the latter (models based on –omics profiles) need to be trained anew for each –omics experiment type. On the other hand, once trained using any text-based gold-standard, approaches such as NLP-ML can be used to classify sample descriptions irrespective of sample type. We demonstrate this versatility by using NLP-ML models trained on microarray sample descriptions to classify RNA-seq, ChIP-seq, and methylation samples.

All the tissue NLP-ML models will be available on GitHub along with code to apply these on any text data to assgin tissue labels to any piece of text.

13:00-13:15

Gathering specified and standardized image quality metadata through a cyber infrastructure

Xiaojun Wang, Tulane University Biodiversity Research Institute, United States
Yasin Bakis, Tulane University Biodiversity Research Institute, United States
Henry L. Bart Jr., Tulane University Biodiversity Research Institute, United States

Presentation Overview: Show

13:15-13:30

Assessment of Image Quality Metadata of Digitized Biodiversity Collection Specimens

Xiaojun Wang, Tulane University Biodiversity Research Institute, United States
Henry Bart, Tulane University Biodiversity Research Institute, United States
Yasin Bakış, Tulane University Biodiversity Research Institute, United States

Presentation Overview: Show

13:45-14:00

Automated Metadata Generation for Biological Specimen Image Collections

Joel Pepper, Drexel University, United States
David Breen, Drexel University, United States
Jane Greenberg, Drexel University, United States

Presentation Overview: Show

Over the last several decades advances in computing, imaging, and cyberinfrastructure have had a major impact on scientific research and discovery. One area of considerable activity is the digitization of the biological specimens that have been collected worldwide by museums and other research institutions. The scanning of these specimen collections and the placement of the resulting images into easily accessible repositories on the Internet is enabling new scientific studies based on the previously unavailable data. Unfortunately, potential scientific advances are hindered by the lack of high-quality and pertinent metadata associated with the image collections. Metadata is required to search the repositories for the imaged specimens needed for a particular study. Since the collections may each contain tens of thousands of images, producing metadata for each image via a manual process is prohibitively labor-intensive and infeasible. Methods for automatically computing metadata from images are therefore needed to fully exploit biological image repositories for scientific discovery.

As a step towards improving metadata in specimen research image collections, our team has been developing methods for automatically analyzing fish images to extract a variety of important features. These fish specimens are being studied for a larger project titled Biology Guided Neural Networks (BGNN), which is developing a novel class of artificial neural networks that can exploit the machine readable and predictive knowledge about biology that is available in the form of specimen images, phylogenies and anatomy ontologies. Using a combination of machine learning and image informatics tools and techniques, we can accurately determine metadata such as fish quantity and location within images, fish orientation and other quantitative fish features, image scaling based on ruler identification and measurement, and general image quality metrics for a substantial number of the images being used in the BGNN project.

Metadata is often unavailable, sparse or incorrect within specimen image repositories, but is vital for subsequent machine learning, analysis and scientific discovery. Our goal is to develop image metadata generation methods that both support the novel machine learning research underway within the BGNN project, and provide a framework for future technology developments that can be deployed by repository curators to improve and bolster the metadata they provide with their specimen images. A longer term goal is to extend the image analysis methods for computing specific quantitative features in support of specific biological investigations. For example, we are able to automatically measure the length of a fish specimen. Associating these measurements with location and acquisition date may provide insights into the influence of habitat factors on fish development/health. Since it is prohibitively expensive for scientists to manually gather this data, we are also interested in applying our tools to images of other species stored in a variety of repositories (e.g. iDigBio). The technical challenges in achieving a broader usage of our approach mostly involve training new classifiers for different types of species, learning to segment and read annotation tags, and generalizing our classifier to find and interpret different types of rulers. This presentation will report on our current efforts to automatically generate metadata for fish specimen images and offer thoughts on how to extend these techniques for other specimen image collections.

14:30-15:45

Keynote: Predicting the evolution of syntenies- An algorithmic overview

Nadia El-Mabrouk

Presentation Overview: Show

International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Flickr

Youtube