Molecular Data

Description

Here, we detail data that will be collected from human clinical samples subjected to molecular approaches (including next-generation sequencing [NGS]) to characterize the genetic information on microorganisms present in these samples. We will use biosensors based on DNA sequencing to leverage microbial and viral identification and quantification in clinical samples, informed by the first-level signals coming from the surveillance of respiratory syndromes. Using molecular characterization we can accelerate the identification of an (re)emerging pathogen, also allowing for the identification of unknown microorganisms and viruses using shotgun-based approaches.

The potential of this data generated by the methodology based on DNA sequence (NGS) is very very good, since we could unravel, at the same time, uncultivable viruses and microorganism leading us to realize a very good characterization of the clinical or enviromental microbiota.

Several studies used molecular/metagenomic data to detect pathogens in human and environmental samples. Plase find a non extensive list of examples below (see References [1] to [5]).

Data access information

A specific data use agreement covering the transfer of raw sequence data outside AESOP team members is being created. Aggregated data tables containing the number or proportions of microorganisms in panel/sequencing data across samples will be made available to federation members as data is produced.

AESOP team will be able to access the raw and processed metagenomic data in the AESOP HPC facility. The bioinformatics pipelines, data analysis and visualization codes will be available at the AESOP GitHub repository.

We already have the necessary permits to perform the metagenomic sequencing focused in the microbial and viral community within the subjects. All genetic material sequenced in Brazil will be registered in SisGen platform. SisGen was implemented following the implementation of Law 13.123/2015 (“Biodiversity Law”), that regulates access to components of the genetic heritage, protection of and access to associated traditional knowledge and the fair and equitable sharing of benefits for the conservation and sustainable use of Brazilian biodiversity.

Methods of data collection

When a potential outbreak is identified, we request local health authorities the immediate sample collection of 100 patients who meet the flu case definition with <5 days of symptoms onset through a systematic convenience sampling process. The samples will be individually submitted to RT-qPCR for SARS-CoV-2, Influenza A (FluA), Influenza B (FluB), and RSV detection. In parallel, pools of 10 samples (500 ul of each) will be prepared and stored in aliquots for pathogen detection using NGS and future validation.

The biological samples will be stored to the closest regional unit of the Fiocruz Genomics Network where an outbreak is identified. The DNA sequence data will be stored in CIDACS HPC Cluster.

The samples will be collected based on outbreak alerts given by the integration of i) respiratory infection symptoms, ii) OTC, iii) social media, iv) environmental, and v) socioeconomic data. We expect that, despite being irregular, the sampling might have seasonal behavior.

Data-specific information

The molecular sensor of AESOP will generate dozens of terabytes of data. The data types will be:

  • Raw DNA short reads sequences data (.fastq): Fastq files have the DNA sequences per se and have each base pair quality score. This file type possesses the DNA sequences in one line, the sequence identifier, the following line, and the quality score for each base. In AESOP, we will generate 2 million sequences for a pool of 100 individuals. We will use the following sequencing approaches:
    1. Respiratory Pathogen ID/AMR Enrichment Panel (RPIP); and

    2. Metatranscriptomics. The first several well-known pathogens, such as viruses, bacteria, fungi, and antimicrobial resistance genes (AMRs) are targeted (Table 1). In the latter, we will be able to identify new pathogens.

  • Processed DNA short reads sequences data (.fasta): Artifacts from the sequencing process and host sequences will be removed. In fasta files, only the DNA sequences are present. This file type possesses in one line the sequence identifier the following line the DNA.

  • Assembled sequences (.fasta): Complete and draft genomes or contigs composes of these files. This file type possesses in one line the sequence identifier the following line the DNA. The difference from the short reads file is that assembled sequences fasta files has fewer sequences but longer.

  • Raw annotation files (.csv, .txt, .tsv): Tabular data originated from the annotation software. One file is for taxonomic, and another for functional annotation. Usually, this type of file has four columns and hundreds of thousands of rows. All taxonomic or functional levels are provided in a single file.

  • Processed annotation files (.csv, .txt, .tsv): Processed tabular data to perform statistical analysis and visualization. Usually, this type of file has hundreds to dozens of thousands of columns and hundreds or thousands of rows, depending on the taxonomic or functional level of the analysis. Each taxonomic or functional level generates a single file.

  • Statistical analysis outputs (.csv, .txt, .tsv): Tabular data containing the results from several statistical analyses. It depends on the performed analysis, but files usually have dozens of columns and hundreds of rows.

  • Visualization products (.png, .jpg, .tiff): Figures generated, e.g., stacked bar, scatter plots, heatmaps, multivariate analysis biplots (PCA, nMDS, CCA).

All data will be stored in the AESOP HPC facility, and the raw sequences data will be held in other servers in Fiocruz. All the codes to perform the bioinformatics analysis, including the pipeline implementation, the statistical analysis, and the data visualization, will be maintained in the AESOP GitHub repository.

Table 1 - Major pathogens and AMRs targeted in RPIP sequencing approach.

Pathogen type

Number of strains/genes

Examples

Viruses

42

Coxsackievirus A

Human adenovirus B

Influenza A viruses

Rhinovirus

SARS coronavirus

SARS-CoV-2 (2019-nCoV)

Bacteria

187

Nocardia nova

Ochrobactrum anthropi

Pseudomonas stutzeri

Prevotella melaninogenica

Streptococcus agalactiae

Treponema denticola

Yersinia pestis

Fungi

54

Alternaria alternata<br>

Candida auris

Exophiala dermatitidis

Purpureocillium lilacinum

Schizophyllum commune

Trichosporon asahii

AMRs

1218

Antibacterials (Aminoglycosides, Carbapenems, Fluoroquinolones)

Antimycobacterials (First-line: Isoniazids, Pyrazinamides. Second-line: Ethionamides, Aminoglycosides)

Antivirals (Oseltamivir, Zanamivir, Peramivir, Laninamivir, Baloxavir)

Limitations of Biological dataset

Due to logistics, the most significant limitation will be assessing remote areas in Brazil to collect biological samples. Difficult-to-access regions, which may be the origin centers of outbreaks, will be monitored using other AESOP data. However, we will focus efforts on collecting patient samples in larger city centers close to those locations. The sampling location choice will consider how connected these areas are, including information about the road, airports, and fluvial networks.

References

Contributors

Pedro Milet Meirelles

Institute of Biology, Federal University of Bahia, Salvador, Brazil

National Institute for Interdisciplinary and Transdisciplinary Studies in Ecology

Evolution (IN-TREE), Salvador, Brazil