GBIF Data Processing

Introduction

Every occurrence record in GBIF goes through a series of processing steps prior to being made available on GBIF.org. The goal of this processing is to increase the consistency, discoverability and usability for GBIF end users. During processing and data normalization, GBIF may apply flags to highlight changes or issues to both end users and publishers. All unprocessed verbatim data is available to end users via Darwin Core Archive (DwC-A) downloads.

Crawling

The first processing step is to harvest data from the service endpoint listed in the GBIF registry. If multiple services are registered, GBIF prefers Darwin Core Archives. On every dataset detail page, you can see all registered services in the external data section of the GBIF registry block. Similarly, endpoints are also included in a dataset detail from our REST API.

In addition to DwC-A, GBIF supports crawling of the XML-based BioCASe, TAPIR and DiGIR protocols. The outcome of any crawling regardless of its protocol is a set of fragments each representing a single occurrence record in its raw form. There are many billions of fragments in GBIF’s internal fragment table. In the case of Darwin Core Archives, this is a JSON representation of an entire star record: a single core record with all the related extension records attached. In the case of the XML protocols, a fragment is the exact piece of XML that we have extracted. Each protocol and content schema (ABCD1.2, ABCD2.06, DwC1.0, DwC1.4, …) therefore still exposes its entire content and nature. For example, here are fragments of ABCD2.06 and Darwin Core.

Occurrence identifier

An important part of GBIF data processing is to assign a stable gbifID to each new record. This is a somewhat complex process that uses the occurrenceID, catalogNumber, collectionCode, and institutionCode in combination with the GBIF datasetkey to either mint a new identifier or reuse an existing one. When publishers do not supply an occurrenceID, GBIF processing will construct an identifier using the so-called triplet code (catalogNumber, collectionCode and institutionCode).

If a previously published dataset alters more than 50% of its existing occurrenceIDs, it will get flagged by our ingestion management system. Typically, a publisher will get an email from GBIF within a day or two asking for a file mapping the old occurrenceIDs to the new occurrenceIDs. A GBIF data blog post has been written on the topic of id stability here.

Occurrence interpretation

Once all records are available in the standard verbatim form, they go through a set of interpretations. If a term is interpreted by GBIF, an attempt is typically made to normalize the values supplied to a limited controlled vocabulary. If a term is not interpreted by GBIF, it will be accepted simply as free text. Other numeric values, such as dwc:decimalLatitude and dwc:decimalLongitude, are checked for validity.

Coordinate interpretation

When available, the coordinates of an occurrence record are given particular attention. If a dwc:geodeticDatum is given, GBIF will attempt to interpret the datum. If the publisher supplied datum is not WGS84, GBIF will attempt to reproject it into WGS84.

GBIF will attempt to parse and verify the following verbatim terms, in the specified order, to derive a valid WGS84 coordinate:

dwc:decimalLatitude and dwc:decimalLongitude
dwc:verbatimLatitude and dwc:verbatimLongitude
dwc:verbatimCoordinates

Suspicious records

GBIF uses four data quality flags to apply a special “suspicious label” to occurrence records that have been flagged with one or more of these:

Zero coordinate
Coordinate out of range
Country coordinate mismatch
Coordinate invalid

Various other processing decisions are made based on the value of the supplied coordinates. For example, by looking at the supplied country, GBIF can detect whether latitude and longitude values are likely to have been erroneously swapped or have negated values. GBIF will then swap these back and apply an appropriate data quality flag.

Geocoding

GBIF will use the interpreted coordinates to make various enrichments and processing decisions for other publisher supplied fields.

Country

The publisher supplied dwc:country is interpreted as a fixed enumeration matching the current ISO countries. When no country but coordinates are supplied, GBIF derives the country from the coordinates using our reverse geocoding API. If the coordinates disagrees with the supplied value, the record will get an appropriate data quality flag.

GADM (Database of Global Administrative Areas)

GADM code at levels 0, 1, 2 and 3 are attached to all occurrence records with coordinates. This is an additional field and does not change or use dwc:stateProvince.

Continent processing

GBIF dwc:continent processing is somewhat more complex than other fields. GBIF first attempts to parse the publisher supplied dwc:continent to a limited number of continent names. If the supplied dwc:continent cannot be mapped or is left empty, an attempt is made to derive the continent from the publisher supplied coordinates. If coordinates are not available, GBIF will finally attempt to use the dwc:country and dwc:countryCode fields to assign the continent using a mapping from country or area to continent, which can be found in this mapping file. Note that if the country or area in this final scenario spans more than one continent, then the field is left blank. Seas and oceans are not considered to be part of a continent in GBIF processing, and where the publisher provides a continent for a marine occurrence this is removed, and an issue flag added.

Normalization

Often publisher supplied data is not sufficiently normalized to be useful for GBIF end users. For this reason GBIF will in certain instances, attempt to normalize the data as much as possible. This is typically accomplished by using a mapping file from a publisher supplied value to a controlled vocabulary.

Vocabulary server

To improve the transparency and flexibility of the mappings between publisher supplied values and controlled vocabularies, GBIF created the vocabulary server.

Occurrence status processing

Occurrence status processing (whether an occurrence is ABSENT or PRESENT) is somewhat more complex than other fields. GBIF first attempts to parse the publisher supplied dwc:occurrenceStatus using a controlled vocabulary. If the supplied dwc:occurrenceStatus cannot be mapped or is left empty, GBIF assumes that record is PRESENT. If the publisher has supplied a valid dwc:individualCount, occurrence status will be inferred using the value of the count. If the individual count is zero, the record will be inferred to be ABSENT. Alternatively, if the dwc:individualCount is greater than zero, the record will be inferred to be PRESENT. However, dwc:individualCount will be ignored if the basis of record is equal to PRESERVED_SPECIMEN or FOSSIL_SPECIMEN, and GBIF will infer the occurrence status as PRESENT. Data quality flags are applied when appropriate.

Issues and flags

GBIF flags records with various issues detected during data processing to help publishers improve data quality and inform users of potential problems. These flags do not always indicate errors—some simply highlight changes made by GBIF, such as normalized values or inferred values.

Taxonomy interpretation

To facilitate searching and metric generation, all occurrence records are tied to a single global taxonomy, known as the GBIF backbone. GBIF builds this taxonomy multiple times per year, which is primarily based on the Catalogue of Life. Higher-level classification above the family level exclusively comes from the Catalogue of Life, while lower taxa can be added in an automated way from other taxonomic datasets available through the GBIF Checklist Bank.

Backbone matching

Every occurrence is assigned a taxonKey which points to the matching taxon in the GBIF backbone. This key is retrieved by querying our taxon match service, submitting the scientificName, taxonRank, genus, family and all other higher verbatim classification. If the scientificName is not present it will be assembled from the individual name parts if present: genus, specificEpithet and infraspecificEpithet. Having a higher classification qualifying the scientificName improves the accuracy of the taxonomic match in two ways, even if it is just the family or even kingdom:

In the case of homonyms or similar spelled names, the service has no way to verify the potential matches, so such names will often get higher taxon matches.
In case a given scientific name is not (yet) part of the GBIF backbone, GBIF can at least match the record to some higher taxon, such as the genus.
Fuzzy name matching, matching to higher taxon or matching to no taxon are issue flags we assign to records.

Type status

The type status of a specimen is interpreted from dwc:typeStatus using the TypeStatusParser according to our type status vocabulary.