Research

The Centre for Infectious Disease Genomics and One Health at SFU is an interdisciplinary group of researchers with backgrounds in microbiology, molecular biology, computer science, cognitive science, statistics, public health, and genomics. We are interested in solving practical health problems using multidisciplinary approaches and in a One Health framework. Our research combines knowledge engineering techniques (e.g. ontology modeling, data curation, semantic web), bioinformatics tools (e.g.genomic sequence analysis, phylogenetic, comparative genomics, text mining, workflow and platform development) and molecular laboratory experiments (microbial genomics, metagenomics, eDNA, and virome) to understand the impact of infectious diseases on human and animal health. Given the diversity of microorganisms (bacteria, viruses, parasites) and the complex of biological phenomena that we seek to understand, data-driven research in microbiology requires researchers to collaborate and to share data. Developing technologies to foster data harmonization and sharing, and building trusted networks of health care practitioners, researchers, and policy makers are therefore key focuses of the Centre. We are interested in working with partners to generate data, to analyze data, and to share (and re-use) data for the broader benefit of the scientific community. Our facility includes a state-of-the-art molecular laboratory for microbial sample processing and omics data generation and access to high-performance computing clusters and cloud computing (courtesy of Compute Canada) to provide a “one-stop-shop” for research and collaboration.

Projects

Our team members lead and participate in the following projects and consortia. Please click on the link for more information.

CS-DCC

The Climate-Smart Data Collaboration Centre (CS-DCC) connects ICT projects and supports data governance, harmonization, and analytics to advance omics research in climate-smart agriculture and food systems.

Environmental Surveillance of Avian influenza viruses

Avian Influenza (AI) is a viral disease that can cause significant morbidity and mortality in domestic poultry. Wild waterfowl are the reservoirs for AI and focus of AI surveillance programme around the world. This project focuses on refining the AI sediment surveillance technology and methodology, validating the sediment surveillance approach in the field, and identifying the optimal combination of AI surveillance techniques for maximum efficiency and efficacy. The information generated in this project will be used to develop and implement a new Provincial Waterfowl AI Surveillance Program, with genomic analysis of wet sediments as the cornerstone of this program. This project is in collaboration with Dr. Chelsea Himsworth at the BC Ministry of Agriculture Animal Health Centre and Dr. Natalie Prystajecky at BC Centre for Disease Control Public Health Laboratory.

Canadian COVID Genomics Network VirusSeq

With $40M funding from Innovation, Science, and Economic Development Canada (ISED), Genome Canada established CanCOGeN in April 2020 in response to the COVID pandemic. The mission of CanCOGeN is to establish a coordinated pan-Canadian, cross-agency network for large-scale SARS-CoV-2 viral and human host genome sequencing to track viral origin, spread and evolution, characterize the role of human genetics in COVID-19 disease and to inform time-sensitive critical decision making relevant to health authorities across Canada during the pandemic. The network further contributes to improving national public health capacity to address future outbreaks and pandemics. Led by Drs. William Hsiao and Emma Griffiths, members of the CIDGOH team develop data standards, provide data curation services, and develop software applications to facilitate the management, integration, and harmonization of COVID-19 data. Moreover, members of the CIDGOH participate in various CanCOGeN working groups to contribute bioinformatics, analytical, and curation expertise to the network.

VirusSeq National Data Portal:

The goal of the CanCOGeN VirusSeq project is to sequence up to 150,000 viral samples from Canadians testing positive for COVID-19. The VirusSeq Data Portal is an open-source and open-access data portal for all Canadian SARS-CoV-2 sequences and associated non-personal contextual data. It harmonizes, validates and automates submission to international databases. The CIDGOH curation team coordinates with the public health agencies for the data curation and upload to the portal.

Common Infrastructure for National Cohorts in Europe, Canada, and Africa

Over the last forty years, we have seen the emergence of increasing numbers of large cohorts of human samples from research and national healthcare initiatives (e.g. UK Biobank, H3Africa). In Europe alone, the BBMRI-ERIC directory lists nearly 600 distinct biobanks with over 100 million samples of biomaterials and genetic data. In Canada there are sixteen biobanks and in Africa, three H3Africa Biorepositories are now established. The rapid advances in molecular biology techniques such as DNA genotyping and sequencing has enabled ever larger cohorts of human exomes and genomes to be generated, facilitating study of the underlying causes of disease. The CINECA project aims to build a federated solution enabling population scale genomic and biomolecular data accessible across international borders accelerating research and improving the health of individuals resident across continents. CIDGOH members are involved in Work Package 3: Cohort Level Metadata Representation and facilitate the development of common metadata standards and natural language processing tools to harmonize the data.

Integrated Rapid Infectious Disease Analysis Platform

Public health, food safety, and clinical microbiology labs around the world are embracing whole genome sequencing and genomic epidemiological approaches to modernize their infectious disease research, surveillance, diagnostics, and outbreak investigation programs. This new technology requires powerful, yet easy-to-use, intuitive software in order to transition these new Big Data methods out of the lab and into the front lines of public health activity. In the past few years, the IRIDA Consortium has built an easy to use, open, and freely available platform. Te IRIDA platform is designed to make infectious disease genomics accessible to epidemiologists, clinical microbiologists, and the broader research community. Ongoing development of the platform will see it becomes cloud enabled. Dr. Hsiao is a founding member of the IRIDA Consortium.

Pandemic & Borders

During public health emergencies of international concern (PHEICs), effective global responses require coordinated action across jurisdictions. During the COVID-19 pandemic, countries have used travel measures to an unprecedented degree and in an uncoordinated way. Our Pandemics and Borders Project is analysing a global dataset on travel measures; systematically reviewing evidence of their impacts; and conducting case studies of decision making in Canada, USA and Hong Kong. The project will build evidence-informed decision making on whether, when, what and how travel measures should be used. Our aims are to: 1) comparatively review and apply new methods to assess public health risks from travel during COVID-19; 2) evaluate the effectiveness of mitigating public health risks during COVID-19 of specific travel measures under different conditions; and 3) use findings to develop scenarios and pilot training exercises that simulate decision making on managing borders during PHEICs. We will use systematic reviews, various types of modelling, and viral genomic analyses to newly available datasets. We will focus on our three current case studies, with potential extension to other jurisdictions. The primary outcome will be strengthened capacity to make evidence-informed choices that enhance coordinated use of travel measures during PHEICs.

Ontologies

Using the latest semantic web technologies, we develop standard vocabularies and data representation to support data management, integration, and sharing.

Next Generation Biobanking Ontology (NGBO)

A biobank contains a collection of biological samples, along with associated medical information of sample donors, which can be used for different types of studies. Given the wealth of information that can be derived from stored informationand biological materials, there is a pressing need for structuring biobank data for more computer-amenable analyses. The utility of first generation biobanks was originally evaluated simply based on the number of samples that they contained.Currently, the value of biobank data lies in how it can linked with other molecular and clinical data (“-omics data”), to provide new insights into health and disease. Linking data has thus far, however, proven challenging due to unstructuredand incompatible data types. Here, we describe the development of a Next-Generation biobanking ontology (NGBO) (https://github.com/Dalalghamdi/NGBO) that is capable of supporting both Biospecimen processing, management, storage and retrieval infrastructure, and acting as a knowledge hub for an integrated clinical and translational research ecosystem integrating –omics data. NGBOharmonizes the instrumentation and procedures used to prepare and process specimens, and also cover terminology used to describe computational biology algorithms, analytical tools, electronic-communication protocols, in vitro assays.Laboratories, investigators, and other biobanks would also benefit from the knowledge contained in the ontology, by the means of using NGBO a biobank data catalogue that can be used to map any existing unstructured data.

FoodOn: A farm to fork food ontology

FoodOn is a consortium-driven project to build a comprehensive and easily accessible global farm-to-fork ontology about food which accurately and consistently describes foods commonly known in cultures from around the world. Through ourinvolvement with the IRIDA.ca project, Hsiao Lab found that a food vocabulary resource incompatibility problem existed between agencies seeking to share genomic and contextual information about foodborne pathogen biosamples – ranging fromCanada’s federal departments and provincial health authorities, and extending to international partners such as the US FDA, European EFSA, and WHO FAO. Our current work focuses on developing FoodOn to handle the vast majority of food sampledescriptions, and on creating the cross-referencing necessary for food sample data translation between European and north American health authorities. Â Partners like IC-Foods.com and agencies like the USDA ARS are exploring the use of FoodOnwithin their research projects and database systems, and we see an opportunity to serve commercial platforms with FoodOn’s “lingua franca”of food products.More information regarding GenEpio can be found at (http://foodon.org)

Genomic Epidemiology Ontology (GenEpiO)

To better harmonize and integrate genomics data into food microbiology and public health workflows, we have developed the Genomic Epidemiology Ontology (GenEpiO), which aims to provide a single, open-source, globally accessible set of terms to use in databases and software user interfaces. GenEpiO is being testedfor use in a number of different platforms and initiatives, such as Canada’s Integrated Rapid Infectious Disease Analysis (IRIDA) platform, the US FDA’s GenomeTrakr Foodborne Pathogen Surveillance Network, University of Warwick’s Enterobase sequence typing platform, and a new InternationalOrganization for Standardardization (ISO) standard for the implementation of WGS for food microbiology. The use of GenEpiO has also been included as part of best practices for the application of genomic data supporting regulatory food safety. Our current projects focus on expanding GenEpiO vocabulary with regards to animal health and agricultural in order to better fit with a One Health approach to surveillance, analyses and investigations. In addition to foodborne pathogens,GenEpiO continues to expand its scope, incorporating vocabulary for other pathogens such as influenza and tuberculosis. The Hsiao’s lab is also working to create ontology-driven tools and metadata specifications (fields of information) tohelp harmonize, exchange and integrate data across sectors. More information regarding GenEpio can be found at (www.genepio.org.

TOOLS

The CIDGOH team develop various software applications to support and enhance our research. We make these tools freely available through open source licenses. Please find a list of tools below and check out our GitHub site for more tools!

AIV Seeker

AIV_seeker pipeline that is optimized for detecting and identifying low-abundance avian influenza virus (AIV) from metagenomic NGS data. This tool provides automation and orchestration of a workflow to process metagenomics NGS data and to generate subtyping results. A heatmap is generated for visualization as well as a new prototype to deal with index hopping issue resulted from Illunima platform. Available as a web service.

COVID-MVP

SARS-CoV-2 Variants of Concern (VOC) & Interest (VOI) pose high risks to global public health. COVID-MVP tracks mutations from VOCs and VOIs to enable interactive visualization in near-real time. COVID-MVP has 3 modules: A Nextflow-wrapped workflow nf-ncov-voc(https://github.com/cidgoh/nf-ncov-voc) for identifying mutations in genomic data; a Python module for functional annotation, based on literature curation; and an interactive visualization for prevalence of mutations in variants and their functional impact, based on Dash & Plotly frameworks. COVID-MVP is hosted at covidmvp.cidgoh.ca. More details can be found at covidmvp (https://github.com/cidgoh/COVID-MVP).

Data Harmonizer

A standardized spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 sampling data. More details can be found at(https://github.com/cidgoh/DataHarmonizer)

Genomic Epidemiology Ontology Mart

The Genomic Epidemiology Entity Mart (GEEM) aims to provide a faster way for system implementers to gain access to the vocabulary standards provided by ontology communities like http://OBOFoundry.org – without requiring users to have training in ontology development. More information regarding GenEpio can be found at http://genepio.org/.

LexMapr

http://lexmapr.cidgoh.ca/

Researchers often require open-source software tools for cleaning up and harmonizing free-text specimen descriptions. However, there aren’t many options for them currently to choose from as the majority of the publically available text mining systems are focused more on grammatically well-formed and longer texts. Hsiao’s group has developed a text mining system a “LexMapr” that mines the short free-text specimen descriptions and maps the detected entities to terms from selected domain ontologies that provide the breadth of terms sought for sample description. LexMapr is an open-source tool and the source code has been made available at https://github.com/lexmapr/LexMapr and we intend to provide an API platform for LexMapr in the near future.

Nextflow Raw QC Analysis Tool

Nextflow workflow for checking quality of NGS data. To run tasks across multiple compute infrastructures leveraging a modular and portable stack. Comes with docker/singularity containers for ease of installation and maximum reproducibility.

SeqUDAS: Sequence Upload and Data Archiving System

Modern DNA sequencing machines generate several gigabytes (GB) of data per run. Organizing and archiving this data presents a challenge for small labs. Hsiao’s has created a Sequence Upload and Data Archiving System (SeqUDAS) that aims to ease the task of maintaining a sequence data repository through process automation and an intuitive web interface.