Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Taxonomic Names and Metadata: A Framework for Big Data Interoperability

Schedule subject to change.
All times in Central Daylight Time (CDT)
Thursday, May 13th
9:00-10:00
Keynote
  • Michael Osterholm
11:00-11:15
Taxonomic Names and Metadata Codes for Biodiversity Specimens: the Rules and the Road
  • Xiaojun Wang, Tulane University, United States
  • Yasin Bakis, Tulane University, United States
  • Henry Bart, Tulane University, United States

Presentation Overview: Show

Research involving organisms depends critically on a long-standing tradition of defining species and higher groups of organisms and assigning names to them (taxonomy), as well as conventions for recording metadata on where, when and under what conditions specimens of organisms were captured and the complete subsequent history of study of those specimens (provenance). In this presentation, we provide an overview of nomenclatural codes that govern the names of taxa and ensure uniqueness and universality of taxon names (the Rules), and the codes (standards) that have been adopted for recording metadata describing provenance of biological specimens (the proverbial Road). By “biodiversity specimens” we mean specimens of organisms captured in the field and archived in museums, and digital representations (i.e., images) of those specimens. We discuss problems involving the taxonomic names assigned to biodiversity specimens and/or images of them. We also discuss conventions for metadata descriptions specific to biodiversity specimens and digital representations of specimens. We provide examples of the challenges that problems in both of these arenas present to using biodiversity specimens/images in research, and some solutions to those challenges based our work with fish specimen images in the Biology-Guided Neural Networks for Discovering Phenotypic Traits project.

11:15-11:30
Taxon Names Global to Local: Uses, Issues, Potential
  • Matthew Yoder, University of Illinois Urbana Champaign, United States
  • Deborah Paul, University of Illinois Urbana Champaign, United States

Presentation Overview: Show

Humans do the work to reveal the interconnectedness of life on the planet that supports it. Taxonomic names and related metadata and media underpin this critical research. These scientific endeavors across disciplines support local, regional, and global initiatives such as species discovery, conservation assessments, restoration work, science communication, agricultural crop science, ecological and biological associations investigation, preventing, mitigating, and predicting pathogen-related spill-over events, and discovering, understanding, and protecting natural resources.

Data organized, standardized, and aggregated taxonomically contribute to our efforts to address small and large-scale science and synthetic research (e.g. invasive species, human encroachment, extinction prediction, water quality monitoring, niche modeling, etc.). Taxonomic names help us manage and access related data in spreadsheets and in our respective databases for collection management, lab data, aggregated biodiversity data, and so on.

Taxon names (and related metadata) play roles at global, regional, institutional, and personal scales. For each, we offer some examples focusing on advances in development, best practices adopted and currently used, and opportunities for improvement. Globally, we will look at the Global Biodiversity Information Facility (GBIF), the Catalog of Life (CoL, https://www.catalogueoflife.org/), Global Names (http://globalnames.org/, https://github.com/gnames), the Biodiversity Information Standards (TDWG) Group, the Biodiversity Community Integrated Knowledge Library (BiCIKL), and Bionomia (https://bionomia.net/). Regionally, we check out iDigBio (https://www.idigbio.org/), DiSSCo (https://www.dissco.eu/), ITIS (https://www.itis.gov/), WoRMS (http://www.marinespecies.org/), TaxonWorks (https://taxonworks.org/), and iNaturalist (https://www.inaturalist.org/) and institutionally, we will examine the Biodiversity Heritage Library (BHL). Then, we share some current realities of every-person examples from several biodiversity disciplines including those who manage collections.

Specimens vouchered in scientific collections ground Taxonomy (Thompson CW 2021, Sharkey MJ 2021) and support reproducible research. Images of specimens offer opportunities to increase access to taxonomic information both by humans and computers. Rich (meta)data about these specimens and images make it possible for this information to be FAIR (findable, accessible, interoperable, reusable) and linked to unambiguously relate vouchered specimens, samples, and related sequences (Thompson CW 2021). Collectively these data create the requisite foundation for a robust biological science whose mission is increasingly focused on the genome (Thompson CW 2021).

Consider your own known scientific needs for and reliance on taxonomic information through the lens of a user story, for example: As a(n) disease ecologist, ichthyologist, computer scientist, microbiologist, science writer, instructor, etc., I need information (I) in format (F) to answer question (Q) for audience (A). Many known issues arise when using, tracking, and trying to keep up with taxon names and related (meta)data. We look forward to hearing about your particular experiences. Keeping your own requirements and expertise networks in mind, join us as we look at some specific examples of the ways in which taxon names and associated (meta)data support scientific work varying in both scale and scope and present challenges.

Sharkey MJ, Janzen DH, Hallwachs W, Chapman EG, Smith MA, Dapkey T, Brown A, Ratnasingham S, Naik S, Manjunath R, Perez K, Milton M, Hebert P, Shaw SR, Kittel RN, Solis MA, Metz MA, Goldstein PZ, Brown JW, Quicke DLJ, van Achterberg C, Brown BV, Burns JM (2021) Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species. ZooKeys 1013: 1–665. https://doi.org/10.3897/zookeys.1013.55600

Thompson CW, Phelps KL, Allard MW, Cook JA, Dunnum JL, Ferguson AW, Gelang M, Khan FAA, Paul DL, Reeder DM, Simmons NB, Vanhove MPM, Webala PW, Weksler M, Kilpatrick CW. 2021. Preserve a voucher specimen! The critical need for integrating natural history collections in infectious disease studies. mBio 12:e02698-20. https://doi.org/10.1128/mBio.02698-20.

11:30-11:45
Biodiversity informatics from rows to things, and semantics to practice
  • Matthew Yoder, Illinois Natural History Survey, Prairie Research Institute, University of Illinois., United States
  • Deborah Paul, Illinois Natural History Survey, Prairie Research Institute, University of Illinois, United States

Presentation Overview: Show

Since the advent of computers biodiversity experts have been using them to manage taxon names, their meta-data, and the vast data they point to. After an explosion of digital tools and methods proliferated, various standards, best practices, and shared services have emerged. These federating efforts facilitate data sharing and re-use, however they also place newfound constraints and demands on both developers and data curators, i.e. the scientists the tools seek to serve. Balancing the need to allow for growth and novelty with the need for standards and semantics and also prioritizing the production of tools that day-to-day scientists require is itself a major undertaking. Here we highlight a series of anecdotal examples that elucidate the complex nature of this enterprise.

First we focus on taxon names, the primary ""address"" of biological data. We clarify their role in the network of biodiversity data, describe cutting edge efforts to detect them at a global scale using supercomputers, the Hathi-Trust corpus, and the GlobalNames (https://gnames.org) initiative and also overview NOMEN (https://github.com/SpeciesFileGroup/nomen), an ontology that seeks to provide a logical-interpretation of the rules of zoological nomenclature. Names are not biological concepts, systems that fail to distinguish this fundamental difference are bound to be plagued by downstream issues that conflate nomenclatural synonymy with biological synonymy. This distinction, the need for clear boundaries, and an approach provided in TaxonWorks (https://taxonworks.org), a curatorial platform for biodiversity data is described.

Names are just one class of identifiers, i.e. metadata that ""localizes"" users and machines to related data. Modern systems must assume that things have multiple identifiers, allowing curators to reference and assert data via different standards. In practice systems that reference external standards require ORBs (object-request brokers) that provide temporary identifiers while fixed, more stable or permanent identifiers are discussed, refined, and minted.

All of these requirements, and many others like them, must ultimately be boiled down to tools that are usable, and understandable by scientists ""on the ground"". These tools must integrate knowledge across a wide range of domains, for example to cover the content produced in taxonomic monographs. We highlight our efforts in this area, the application workbench TaxonWorks, and demonstrate its ability to contribute a wide range of data, from highly semantic nomenclature, to images, specimen data, and annotations, to the knowledge-graph-of-life via multiple mechanisms including common standards, CSV, and JSON.

Finally, as we evolve from talking about rows of data to talking about digital ""things"" we make a plea for the biodiversity enterprise to adopt a questions-first perspective as a means of driving its evolution. For example ""To answer the question Q coming from audience A, I need (my) data D, packaged in format F"". This approach helps to ensure that exercises in abstraction (e.g. coming up with data-standards) are balanced with the actual needs of researchers on-the-ground. It also permits us to robustly answer the problem described to the best of our ability, then move forward to tackle the infinite number of next steps.

12:45-13:00
Systematic tissue annotations of –omic samples by modeling unstructured metadata
  • Arjun Krishnan, Michigan State University, United States
  • Nathaniel Hawkins, Michigan State University, United States
  • Marc Maldaver, Michigan State University, United States
  • Lindsay Guare, Michigan State University, United States

Presentation Overview: Show

There are currently >1.3 million human –omics samples that are publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, from this ever-growing data collection is still a significant challenge. The major impediment is that sample attributes such as tissue/cell-type of origin are routinely described using non-standard, varied terminologies written in unstructured natural language.

Here, we propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology.

Our approach significantly and substantially outperforms an advanced text annotation method (MetaSRA) that uses graph-based reasoning and a baseline method (Tagger) that annotates text based on exact string matching. We demonstrate the biological interpretability of tissue NLP-ML models using an analysis of their similarity to each other and an evaluation of their ability to classify tissue- and disease-associated biological processes based on their text descriptions alone.

Previous studies have shown that the molecular profiles associated with –omics samples are highly predictive of a variety of sample attributes. Using transcriptome data, we show that NLP-ML models can be nearly as accurate as expression-based models in predicting sample tissue annotations. However, the latter (models based on –omics profiles) need to be trained anew for each –omics experiment type. On the other hand, once trained using any text-based gold-standard, approaches such as NLP-ML can be used to classify sample descriptions irrespective of sample type. We demonstrate this versatility by using NLP-ML models trained on microarray sample descriptions to classify RNA-seq, ChIP-seq, and methylation samples.

All the tissue NLP-ML models will be available on GitHub along with code to apply these on any text data to assgin tissue labels to any piece of text.

13:00-13:15
Gathering specified and standardized image quality metadata through a cyber infrastructure
  • Xiaojun Wang, Tulane University Biodiversity Research Institute, United States
  • Yasin Bakis, Tulane University Biodiversity Research Institute, United States
  • Henry L. Bart Jr., Tulane University Biodiversity Research Institute, United States

Presentation Overview: Show

Interest is growing in assigning quality metrics to images that can influence the selection of training and testing datasets and improve experimental results in machine learning applications. One of the aims of the Biology-Guided Neural Networks for Discovering Phenotypic Traits Project (BGNN) is to establish a set of standardized, human-gathered, image quality properties that improve the performance of neural network in classification, trait segmentation and trait extraction tasks involving a collection of 70,000 fish specimen images produced by the Great Lakes Invasives Network (GLIN). Our review of metadata associated with the GLIN images revealed that the metadata needed for assessing many aspects of the quality of the images were insufficient, unspecified, and unstandardized. To address this issue, we developed cyber infrastructure consisting of a web-based form and relational database to allow technicians to visualize the images and capture metadata for 22 image-quality properties. We imported 23,066 fish specimen images from Illinois Natural History Survey (INHS) Fish Collection into the system. By the time of this abstract submission, metadata for 22 image quality properties have been captured for 21,496 images representing 188 fish species. The image-quality metadata have been shown to improve the performance of neural networks in image-based, species-identification experiments.

13:15-13:30
Assessment of Image Quality Metadata of Digitized Biodiversity Collection Specimens
  • Xiaojun Wang, Tulane University Biodiversity Research Institute, United States
  • Henry Bart, Tulane University Biodiversity Research Institute, United States
  • Yasin Bakış, Tulane University Biodiversity Research Institute, United States

Presentation Overview: Show

The imaging of biodiversity collection specimens has increased substantially within the last few decades especially after the introduction of new methods of morphological analysis and the use of artificial intelligence technologies such as neural networks for species identification and trait data extraction. However, this requires multimedia to be captured in a specific format and presented with some appropriate descriptors. In this study we present an analysis of two image repositories each consisting of 2D images of fish specimens from several institutions - Integrated Digitized Biocollections (iDigBio) and Great Lakes Invasives Network (GLIN). We processed approximately 70 thousand images from GLIN repository and 450 thousand images from iDigBio repository and assessed their suitability for use in neural network-based species identification and trait extraction applications. We found that GLIN images were more successful for our purposes. Almost 40% of the species has been represented with less than 10 images while 20% has more than 100 images. 70% of the GLIN images have found to be appropriate for further analysis according to overall image quality score. Quality issues with these images included, curved specimens, objects in the images such as tags, labels and rocks that obstructed the view of the specimen, color, focus and brightness issues, folded or overlapping parts as well as missing parts. We searched iDigBio database for fish taxa and returned 450 thousand records of images. We were able to filter down to 90 thousand fish images by using multimedia metadata, by excluding some of the non-fish images, fossil samples, Xray and CT scans and several others. Only 44% of these 90 thousand images have been found suitable for the further analysis so far. We introduce new approaches to processing images of biodiversity specimens and new metadata fields specific to assessing image quality.

13:45-14:00
Automated Metadata Generation for Biological Specimen Image Collections​
  • Joel Pepper, Drexel University, United States
  • David Breen, Drexel University, United States
  • Jane Greenberg, Drexel University, United States

Presentation Overview: Show

Over the last several decades advances in computing, imaging, and cyberinfrastructure have had a major impact on scientific research and discovery. One area of considerable activity is the digitization of the biological specimens that have been collected worldwide by museums and other research institutions. The scanning of these specimen collections and the placement of the resulting images into easily accessible repositories on the Internet is enabling new scientific studies based on the previously unavailable data. Unfortunately, potential scientific advances are hindered by the lack of high-quality and pertinent metadata associated with the image collections. Metadata is required to search the repositories for the imaged specimens needed for a particular study. Since the collections may each contain tens of thousands of images, producing metadata for each image via a manual process is prohibitively labor-intensive and infeasible. Methods for automatically computing metadata from images are therefore needed to fully exploit biological image repositories for scientific discovery.

As a step towards improving metadata in specimen research image collections, our team has been developing methods for automatically analyzing fish images to extract a variety of important features. These fish specimens are being studied for a larger project titled Biology Guided Neural Networks (BGNN), which is developing a novel class of artificial neural networks that can exploit the machine readable and predictive knowledge about biology that is available in the form of specimen images, phylogenies and anatomy ontologies. Using a combination of machine learning and image informatics tools and techniques, we can accurately determine metadata such as fish quantity and location within images, fish orientation and other quantitative fish features, image scaling based on ruler identification and measurement, and general image quality metrics for a substantial number of the images being used in the BGNN project.

Metadata is often unavailable, sparse or incorrect within specimen image repositories, but is vital for subsequent machine learning, analysis and scientific discovery. Our goal is to develop image metadata generation methods that both support the novel machine learning research underway within the BGNN project, and provide a framework for future technology developments that can be deployed by repository curators to improve and bolster the metadata they provide with their specimen images. A longer term goal is to extend the image analysis methods for computing specific quantitative features in support of specific biological investigations. For example, we are able to automatically measure the length of a fish specimen. Associating these measurements with location and acquisition date may provide insights into the influence of habitat factors on fish development/health. Since it is prohibitively expensive for scientists to manually gather this data, we are also interested in applying our tools to images of other species stored in a variety of repositories (e.g. iDigBio). The technical challenges in achieving a broader usage of our approach mostly involve training new classifiers for different types of species, learning to segment and read annotation tags, and generalizing our classifier to find and interpret different types of rulers. This presentation will report on our current efforts to automatically generate metadata for fish specimen images and offer thoughts on how to extend these techniques for other specimen image collections.

14:30-15:45
Keynote: Predicting the evolution of syntenies- An algorithmic overview
  • Nadia El-Mabrouk

Presentation Overview: Show

Syntenies are genomic segments of consecutive genes identified by a certain conservation in gene content and order. Regardless the way they are identified, the goal is to characterize homologous genomic regions, i.e. regions deriving from a common ancestral region. Most algorithmic studies for inferring the evolutionary history that has led from the ancestral segment to the extant ones focus on inferring the rearrangement scenarios explaining their disruption in gene order. However, syntenies also evolve through other events modifying their content in genes, such as duplications, losses or horizontal transfers. While the reconciliation approach between a gene tree and a species tree addresses the problem of inferring such events for single genes, few efforts has been dedicated to the generalization to segmental events and to syntenies. In this presentation, I will review the main algorithmic methods for inferring ancestral syntenies and focus on those integrating both gene orders and gene trees.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube