Skip to Main Content

Methodology

Contributing country: This is the "country of origin" of the sequence as determined by the /country field, a metadata field associated with the sequence entry that is defined as "locality of isolation of the sequenced organism indicated in terms of political names for nations, oceans or seas, followed by regions and localities." This is NOT the country where the DNA was sequenced. These data are not available in this dataset using these methods.

Primary Publications: Publications directly linked to a sequence accession record. A primary publication will, in most cases, be published by the scientists that originally generated the sequence(s) listed in the sequence accession record.

Secondary Publications: Publications in ePMC that cited a sequence accession number. Secondary publications were identified via text mining. Secondary publications may be indications of "re-use" of sequence data because they are subsequent to publication of the sequence data. However, the majority of secondary publications identified here do not overlap with primary publications which perhaps suggests they result simply from "normal" use of the INSDC database and subsequent publication based on this use.

Data Source

  1. European Nucleotide Archive (ENA) provides a comprehensive record of nucleotide sequence information
  2. Europe Pub Med Central (Europe PMC) is an open science platform enabling access to life science publications and preprints of trusted sources

Data Extraction Process

The data processing performed to extract, filter and join ENA and ePMC data sets. ENA records are parsed (A1) and filtered for valid country tag and fed into ePMC RestFull API to extract matching secondary publication (B1) by ENA accession or project accession numbers. Primary publications are linked by ENA record (A2) to the DOI, PMCID or PMID. The resulting data sets are normalized as tables ENA_SEQUENCES, PMC_REFERENCES and loaded into the data warehouse (A3, B2) alongside a curated list of world's countries in table COUNTRIES and economics groups in table COUNTRY2GRP (C1). SQL queries (C2) are applied to generate charts and reports in the Web application.



Data Warehouse Schema

Table schema of the WiLDSI data warehouse: - The table ENA_SEQUENCES comprise metadata of a sequence stored in EBI ENA database. The attributes accession and project accession are used to join secondary literature that cite sequences. The attribute country refers to the country table to resolve and group country tagged ENA sequences. The table PMC_REFERENCES consists of all ePMC published papers either referencing a ENA sequence by accession or project accession and references from ENA records as primary publication by either a DOI, PMID or PMCID.