Data quality recommendations

The required terms in the tables are the minimal terms to include for GBIF to index the data. By following the data quality recommendations, data publishers can improve the quality, completeness and value of their datasets. The data Class or category of the individual Darwin Core Terms are indicated by the following colours:

All tables included in a Darwin Core Archive must include a unique identifier column
Data published with the support of one of the programmes managed by GBIF (e.g. BID, BIFA, CESP) have stricter data quality requirements than data published outside of these programmes.
It is the publisher’s responsibility to obscure sensitive species information. Please consult the Current Best Practices for Generalizing Sensitive Species Occurrence Data.

Data quality requirements for checklists

Checklist datasets provide a catalogue, rapid summary or baseline inventory of a set of named organisms, or taxa. While they may include additional details like local species names or specimen citations, checklists typically categorize information along taxonomic, geographic, and thematic lines or some combination of the three.

Table 1. Data quality requirements for checklists
Term Status Status GBIF managed programmes

Taxon taxonID

Required

Required

Taxon scientificName

Strongly recommended

Required

Taxon taxonRank

Strongly recommended

Required

Taxon kingdom

Strongly recommended

Strongly recommended

Taxon parentNameUsageID

Strongly recommended

Strongly recommended

Taxon acceptedNameUsageID

Strongly recommended

Strongly recommended

Taxon vernacularName

Share if available

Share if available

Data quality requirements for occurrences

Occurrence datasets are the core of data published through GBIF, offering evidence of the occurrence of a species (or other taxon) at a particular place on a specified date.

Table 2. Data quality requirements for occurrences
Term Status Status funded GBIF programmes

Occurrence occurrenceID

Required

Required

Record basisOfRecord

Strongly recommended

Required

Taxon scientificName

Strongly recommended

Required

Event eventDate

Strongly recommended

Required

Location countryCode

Strongly recommended

Strongly recommended

Taxon taxonRank

Strongly recommended

Strongly recommended

Taxon kingdom

Strongly recommended

Strongly recommended

Location decimalLatitude

Strongly recommended

Strongly recommended

Location decimalLongitude

Strongly recommended

Strongly recommended

Location geodeticDatum

Strongly recommended

Strongly recommended

Location coordinateUncertaintyInMeters

Strongly recommended

Strongly recommended

Occurrence individualCount

Strongly recommended

Strongly recommended

Occurrence organismQuantity

Strongly recommended

Strongly recommended

Occurrence organismQuantityType

Strongly recommended

Strongly recommended

Record informationWithheld

Share if available

Share if available

Record dataGeneralizations

Share if available

Share if available

Event eventTime

Share if available

Share if available

Location country

Share if available

Share if available

Data quality requirements for sampling events

Sampling-event datasets provide greater detail about a species occurring at a given location and date, including the methods, events and relative abundance of species recorded in a sample. By improving comparisons of data collected using the same protocols at different times and places, these datasets can enable researchers to infer the absence of particular species from particular sites.

Table 3. Data quality requirements for sampling events
Term Status Status funded GBIF programmes

Record type

Required

Required

Event eventID

Strongly recommended

Required

Event eventDate

Strongly recommended

Required

Event samplingProtocol

Strongly recommended

Required

Event sampleSizeValue

Strongly recommended

Required

Event sampleSizeUnit

Strongly recommended

Required

Location countryCode

Strongly recommended

Strongly recommended

Event parentEventID

Strongly recommended

Strongly recommended

Event samplingEffort

Strongly recommended

Strongly recommended

Location locationID

Strongly recommended

Strongly recommended

Location decimalLatitude

Strongly recommended

Strongly recommended

Location decimalLongitude

Strongly recommended

Strongly recommended

Location geodeticDatum

Strongly recommended

Strongly recommended

Location coordinateUncertaintyInMeters

Strongly recommended

Strongly recommended

Location footprintWKT

Strongly recommended

Strongly recommended

Occurrence occurrenceStatus

Strongly recommended

Strongly recommended

Terms

Record basisOfRecord

dwc:basisOfRecord

The type of the individual record. Choose one of the available options in dwc:basisOfRecord.


Record informationWithheld

Record dataGeneralizations

Record type

dc:type

The nature or genre of the resource.


Taxon taxonID

dwc:taxonID

A unique identifier for the taxon, allowing the same taxon to be recognized across dataset versions as well as through data downloads and use. Ideally, the taxonID is a persistent global unique identifier. As a minimum requirement, it has to be unique within the published dataset. It allows to recognize the same set of taxon information over time when the dataset indexing is refreshed; it links additional data like images or occurrence records; and it makes it possible to cite records e.g. in usage reports or in publications. This means that the taxonID needs to reliably stay with the taxon information at source and to consistently refer to the same set of taxon information in published datasets and any underlying source data.


Taxon scientificName

dwc:scientificName

The full scientific name, including authorship and year of the name where applicable. In the context of a checklist, the scientific name is the core data element of a taxon list or hierarchy that the dataset is set out to collate and publish.

Depending on the purpose of the checklist, scientific names may be of any hierarchical level, though typically would be of species rank or below for, e.g., regional floristic or faunistic checklists, Red List collations, or thematic inventories like marine organisms or taxonomic revisions of species groups. If the checklist is intended to publish a hierarchy (tree-like structure), add separate entries for the relevant upper taxonomic ranks, e.g. kingdom, class and family, and link them into a hierarchical structure using the parentNameUsageID (see below) to support unambiguous interpretation of the checklist entries.

Valid scientific names are Latin names following the syntax rules of the respective taxon group (e.g. botanical nomenclature). Not permitted are, i.e., working names (Mallomonas sp.4), common names (fruit fly), or names containing identification qualifiers (Anemone cf. nemorosa). If common names are used, they should be supplied in addition to the scientific names, using the Taxon vernacularName set of fields.


Taxon taxonRank

dwc:taxonRank

The taxonomic rank of the supplied scientific name. The taxon rank supports the interpretation of the scientific name during indexing and supports matching the checklist records to the core taxonomy, especially in the case of names at the genus level or above (monomials). While the format of higher taxon names in some groups contains indicators of their rank, this is not consistent across or even within groups, and cannot be reliably used for interpretation. For placing names correctly, explicitly specifying the taxon rank, alongside information on the higher taxonomy, is an important criterion. For practical purposes, the ranks used have to be (major) Linnean ranks: kingdom, phylum, class, order, family, genus, and species. Both Latin and English terms are accepted.


Taxon kingdom

dwc:kingdom

The full scientific name specifying the kingdom that the scientific name is classified under and other higher taxonomy, if possible.

With scientific names, there are numerous cases where the matching of a given name against the core taxonomy is unsure or ambiguous. This is the case, for example, with homonyms (identical names exist for different organisms, usually across groups), newly described names that are not yet part of the existing taxonomic tree, or spelling variants (typos, hyphenation etc). To support exact matching of a scientific name against the core taxonomy, additional names at higher ranks help interpretation and error prevention. For datasets where the hierarchical representation in the published data is not important, higher-level names can be supplied as part of the record itself by adding the relevant DarwinCore fields, similar to occurrence datasets.

Names should be scientific (Latin) names at major Linnean ranks, like Animalia (kingdom) or Rosaceae (family). Not: common names (animals), abbreviations (Rosac.), intermediate rank levels (Tetrapoda (superclass)), or polyphyletic or non-taxonomic groupings (algae, herbivore).


Taxon parentNameUsageID

dwc:parentNameUsageID

The taxonID of the next available higher-ranked (parent) entry within the checklist dataset, if higher taxon names are supplied as separate entries in the list. This supports the representation of the dataset as a hierarchy, e.g. for the publication of a taxonomy.


Taxon acceptedNameUsageID

dwc:acceptedNameUsageID

Within the record of a synonym, the taxonID of the accepted taxon name entry within the checklist dataset, if both synonyms and accepted names are supplied. This supports the representation of synonymy for a taxonomic dataset.


Taxon vernacularName

dwc:vernacularName

When supplied, also add at least the language of the name, using ISO 639-1 language codes.


Location countryCode

dwc:countryCode

A two-letter standard abbreviation for the country, territory or island of the occurrence locality. Information on the collection or observation locality (geographic reference) is essential for any record. The country code is the proposed minimum standard to supply this information. The format for this field follows the ISO 3166-1-alpha-2 standard for country codes. Those are two-letter codes for each country, territory or island; lists can be found online. Publishers who wish to supply the country name, in addition, may add the appropriate element. In most cases, occurrences can be linked to a specific country, territory or island. In cases where it is not possible to supply a country code (e.g. marine data outside of coastal zones), geographical coordinates should be supplied instead.


Location decimalLatitude

dwc:decimalLatitude

The geographic latitude in decimal degrees. Where coordinate values are available Location decimalLongitude should be filled also. Valid values lie between -90 and 90 incl. (latitude = 0 is the equator). Decimal coordinate values provide a geolocation of the occurrence that is much more informative than the country name alone, and that is stable over time (unlike the borders of countries). Many data use cases require coordinates if the data are to be of value or usable at all, for example, species distribution modelling or population studies in specific areas.

Several issues concerning coordinates are encountered frequently. While the indexing process makes efforts to identify such cases and propose corrections, e.g. by plausibility-testing coordinates against country names, attention is needed already at the level of data preparation and publication. Such issues include transformation errors (resulting from e.g. conversion of degrees-minutes-seconds into decimal values), accidental swapping of values, either in the dataset or during the mapping process (latitude and longitude are reversed), or negation of values (transposition of locations from north to south, east to west or vice versa through the accidental or systematic loss or addition of minus-values). Additional points to keep in mind during data preparation are technical defaults (e.g. database settings substituting 0-values instead of unknown values resulting in records supplying lat/long as 0/0); over-precision of data by automatic number-padding (lat -17.79200000 where lat -17.792 would be appropriate), or the need to blur coordinate precision e.g. the protection of sensitive species. Also note that gridded data, i.e. where coordinates represent centroids of grid cells in a field survey rather than the actual occurrence locality, may be better represented by publishing the dataset as event data rather than as occurrence records. Especially in such cases, it is essential also to supply the Location coordinateUncertaintyInMeters.


Location decimalLongitude

dwc:decimalLongitude

The geographic longitude in decimal degrees. Where coordinate values are available Location decimalLatitude should be filled also. Valid values lie between -180 and 180 incl. (longitude = 0 is the Greenwich Meridian). Decimal coordinate values provide a geolocation of the occurrence that is much more informative than the country name alone, and that is stable over time (unlike the borders of countries). Many data use cases require coordinates if the data are to be of value or usable at all, for example, species distribution modelling or population studies in specific areas.

Several issues concerning coordinates are encountered frequently. While the indexing process makes efforts to identify such cases and propose corrections, e.g. by plausibility-testing coordinates against country names, attention is needed already at the level of data preparation and publication. Such issues include transformation errors (resulting from e.g. conversion of degrees-minutes-seconds into decimal values), accidental swapping of values, either in the dataset or during the mapping process (latitude and longitude are reversed), or negation of values (transposition of locations from north to south, east to west or vice versa through the accidental or systematic loss or addition of minus-values). Additional points to keep in mind during data preparation are technical defaults (e.g. database settings substituting 0-values instead of unknown values resulting in records supplying lat/long as 0/0); over-precision of data by automatic number-padding (lat -17.79200000 where lat -17.792 would be appropriate), or the need to blur coordinate precision e.g. the protection of sensitive species. Also note that gridded data, i.e. where coordinates represent centroids of grid cells in a field survey rather than the actual occurrence locality, may be better represented by publishing the dataset as event data rather than as occurrence records. Especially in such cases, it is essential also to supply the Location coordinateUncertaintyInMeters.


Location geodeticDatum

dwc:geodeticDatum

The coordinate system and set of reference points upon which the geographic coordinates are based. Different geodetic systems exist, and the exact locality of a point depends on which reference system the coordinates refer to. This is why the system should always be explicitly named when known: depending on the geographic region, the datum shift between two systems can vary from zero to hundreds of meters for a given point. When no value is supplied, GBIF’s indexing process assumes the reference system to be WGS 84 (World Geodetic System 1984, a global approximation at sea level and, i.e., base of GPS data); but the more frequently the geodetic datum can be supplied explicitly by data publishers, the more reliable the geographic representation of occurrences will become, e.g. through datum conversion. It is likewise important to explicitly document the lack of knowledge of the system used, as this increases confidence in data interpretation. Examples: WGS84; EPSG:4326; unknown.


Location coordinateUncertaintyInMeters

dwc:coordinateUncertaintyInMeters

The horizontal distance from the given Location decimalLatitude and Location decimalLongitude in meters, describing the smallest circle containing the whole of the Location. This is an indicator of the accuracy of the coordinate location, described as the radius of a circle around the stated point location. It allows estimating the potential distance of the real occurrence location from the recorded values and largely depends on the methodology used in coordinate determination. Thus, the value may be specific to or estimated from the methodology or device used for geolocating, e.g. 30 (reasonable lower limit of a GPS reading under good conditions if the actual precision was not recorded at the time). Note that 0 (zero) is not a valid value for this measure. If the value is unknown or not applicable, the value should be empty (null). If for some reason the coordinateUncertaintyInMeters was artificially increased, for example by rounding the coordinate values, the fields Record informationWithheld or Record dataGeneralizations must be filled in addition. Examples: 30; 71; [empty]. Not: 0.


Location country

Location locationID

dwc:locationID

An internal or external reference that links to a set of data describing the sample event location, if available. Example: http://www.geonames.org/10793757/dnb-6.html. Note: if such a reference cannot be meaningfully supplied, consider supplying more location detail, e.g. through use of the data elements locality, minimumElevationInMeters, minimumDepthInMeters, stateProvince, locationRemarks.


Location footprintWKT

dwc:footprintWKT

An alternative area description, specifying the location of the sample event in Well-known text (WKT) markup language. A WKT representation of the shape (footprint, geometry) that defines the location. This differs from the point-radius representation that is combined from the elements Location decimalLatitude, Location decimalLongitude and Location coordinateUncertaintyInMeters in that it can define shapes that are not circles. Example: a one-degree bounding box with opposite corners at (longitude=10, latitude=20) and (longitude=11, latitude=21) would be expressed in well-known text as POLYGON 10 20, 11 20, 11 21, 10 21, 10 20. Note that it is possible to supply both a point-radius and a footprintWKT location for the same sample event.


Occurrence occurrenceID

dwc:occurrenceID

A unique identifier for the occurrence, allowing the same occurrence to be recognized across dataset versions as well as through data downloads. As a minimum requirement, it has to be unique within the published dataset, but can also be a globally unique identifier. It allows users to recognize the same occurrence over time when the dataset indexing is refreshed. OccurrenceIDs also link additional data like images to the record, and it makes it possible to cite records. This means that the occurrenceID needs to reliably stay with the occurrence at source, and to consistently refer to the same occurrence in published datasets and any underlying source data.

The occurrenceID in a dataset helps GBIF identify whether an occurrence record is new. If it is new, GBIF assigns it a new unique gbifID. Some publishers include information, such as the collection or institution code, within the occurrenceID. However, if the collection or institution changes, the occurrenceID must also change, even though the actual occurrence record remains the same. This practice can lead to unnecessary instability in occurrenceIDs and gbifIDs. If possible, we now encourage publishers to use an occurrenceIDs with more stability, that do not encode information about the occurrence or specimen. For example, a simple large integer or UUID.

An important part of GBIF data processing is to assign a stable gbifID each new record. This is a somewhat complex process that uses the occurrenceID, catalogNumber, collectionCode, and institutionCode in combination with the GBIF datasetKey to either mint a new identifier or reuse an existing one. When publishers do not supply an occurrenceID, GBIF processing will construct an identifier using the so-called triplet code (catalogNumber, collectionCode and institutionCode).

If a previously published dataset alters more than 50% of its exsisting occurrenceIDs, it will get flagged by our ingestion management system. Typically, a publisher will get an e-mail from GBIF within a day or two asking for a file mapping the old occurrenceIDs to the new occurrenceIDs. A GBIF data blog post has been written on the topic of id stability here.


Occurrence individualCount

dwc:individualCount

Use the individualCount field to capture the number of individuals for the species associated with the occurrence.


Occurrence organismQuantity

dwc:organismQuantity

To record the quantity of a species occurrence. Use together with Occurrence organismQuantityType to specify the quantity e.g., organismQuantity: 5/ organismQuantityType: individuals. organismQuantity: r / organismQuantityType: BraunBlanquetScale.


Occurrence organismQuantityType

dwc:organismQuantityType

To record the quantity type of a species occurrence. Use together with Occurrence organismQuantity to specify the type of measurement e.g., organismQuantity: 5/ organismQuantityType: individuals. organismQuantity: r / organismQuantityType: BraunBlanquetScale.


Occurrence occurrenceStatus

dwc:occurrenceStatus

Note: this applies to associated occurrence data, not to the sample event itself. A qualifier for individual occurrence records, marking a taxon as either present or absent at a location during the sampling event. Since sample datasets document the sampling effort exerted during the event, it can often be valuable to not only document taxa as being present (observed, collected) at the location at the time, but also to record negative occurrences (absences) for taxa that could be reasonably expected, but were not encountered in the event. An example is a floristic survey that estimates the abundance or coverage of plants in a certain area, working from a list of species that were encountered on earlier surveys of that same region. Recommendation: use the standard values of either present or absent to mark individual occurrence records.


Event eventDate

dwc:eventDate

Dates and times published in Darwin Core should use the ISO 8601-1:2019 standard. Please see the following documentation for more details.


Event eventTime

Event eventID

dwc:eventID

A unique identifier for the sampling event, allowing to link individual occurrences to a specific event, and to cross-reference events to document e.g. time series (resampling) or synchronized sampling across a wider area.

The eventID can be a persistent global unique identifier, or an identifier specific to the dataset. Its main function is to allow linking to related data (occurrences, other sampling events, site images etc.). While dataset-specific eventIDs are sufficient to refer to occurence records published within the same dataset, it is worth considering that very simple IDs like numbers could easily reoccur in other, unrelated datasets, and make external linkages ambiguous. In addition, the eventID needs to reliably stay with the sampling event information at source and consistently refer to the same event, or else any data links will be broken.


Event samplingProtocol

dwc:samplingProtocol

The name of, reference to, or description of the method or protocol used during a sample event. Sample events typically use specific methods or follow certain protocols that standardize the sampling effort to a certain degree. Knowledge about the sampling protocol gives users additional information that is helpful for the interpretation of the attached occurrence records, e.g. what kind of organisms to expect or not expect within the dataset and whether the absence of a recording signifies absence in nature, or was outside the target group of the applied sampling methodology (e.g. UV light trap). If a more detailed description of the method or protocol exists, providing a reference is strongly encouraged (e.g. Penguins from space: faecal stains reveal the location of emperor penguin colonies. While there is no controlled vocabulary for this element, the goal is to, across datasets, gradually assemble a library of references for reuse, and to allow users to identify datasets that are based on comparable methods and protocols.


Event sampleSizeValue

dwc:sampleSizeValue

Note: Event sampleSizeUnit should always be shared with the corresponding sampleSizeValue.

A numeric value and the corresponding unit for the value, specifying the size of an individual sample in the sampling event. The two sampleSize fields always go together, and specify the size of an individual sample within a sample event. The sample size can relate to time duration, a spatial length (e.g. of a trawl), an area or a volume. A vegetation plot, for example, may have a sampleSizeValue of 2 with a sampleSizeUnit of square kilometer. Recommended best practice is to use a controlled vocabulary for the Event sampleSizeUnit.


Event sampleSizeUnit

dwc:sampleSizeUnit

Note: Event sampleSizeValue should always be shared with the corresponding sampleSizeUnit.

A numeric value and the corresponding unit for the value, specifying the size of an individual sample in the sampling event. The two sampleSize fields always go together, and specify the size of an individual sample within a sample event. The sample size can relate to time duration, a spatial length (e.g. of a trawl), an area or a volume. A vegetation plot, for example, may have a sampleSizeValue of 2 with a sampleSizeUnit of square kilometer. Recommended best practice is to use a controlled vocabulary for the Event sampleSizeUnit.


Event parentEventID

dwc:parentEventID

A cross-reference to the eventID of a broader event, e.g. a long-term monitoring project that the specific event is a part of or a general vegetation survey of a larger area that is comprised of a number of sub-plots. To be able to reference a parent event, this event needs to be specified as a separate entry, typically within the same dataset, carrying its own eventID. Refer to the eventID of the parent event in the sample event record to specify the relationship between the two entries.


Event samplingEffort

dwc:samplingEffort

The measure for the amount of effort that was expended during a sampling event. The amount of effort expended during a sampling event often influences the result. It included factors like the number of observers involved, or the total time spent collecting, the number of traps exposed over a certain amount of time, the total distance covered, and the mode of transport used, while surveying a plot, etc. Examples of sampling effort are 40 trap-nights, 10 observer-hours. While there is no controlled vocabulary, the recommendation is to keep this information brief and factual, giving users enough information to compare between sampling events.


Status

Required information

The terms constitute the minimum formal requirements for publishing an occurrence dataset. GBIF will not accept a dataset without these terms and will not index the records. While these items are mandatory for publishing the dataset, they are only the starting point. The usefulness of the published data will still be severely limited unless additional information is supplied.

In addition to the mandatory terms, GBIF strongly recommends completing several more fields that help improve the usefulness of the dataset:

  • some information supports the integration into global data resources and prevents ambiguity, e.g. in matching scientific names that could apply to more than one organism (homonyms) to the correct place within the backbone taxonomy

  • more precise geo-location data (coordinates) significantly increase the usefulness of the data for a wide range of use cases

  • additional qualifiers for some data elements, e.g. coordinates, support the interpretation of those elements and help users to better estimate their usefulness for a given data use case

  • some data redundancy supports quality control and error detection (e.g. testing country codes against coordinates where both are supplied)

  • last but not least, the richer the spectrum of available information of a dataset is, the more potential usage areas it becomes available for, meaning the dataset will be more widely accessible and used, and cited more often

Share if available

If additional data are available, consider sharing them to increase the usefulness of your published data.