Occurrence clustering

There might be duplicated records in the GBIF index. Some records may have the same date, scientific name, catalogue number and location but come from two different publishers or have slightly different attributes.

There are many valid reasons why these duplicates appear on GBIF. Sometimes an observation was recorded in two different systems, sometimes several records correspond to herbaria duplicates (you can check the work of Nicky Nicolson on the topic), sometimes a specimen was digitized twice, sometimes a record has been enriched with genetic information and republished via a different platform…

This is why we released an experimental data-clustering feature aiming to identify potentially related records on GBIF. The goal was not only to detect potential duplicates but to also find interesting relationships, such as between typification records or records originating from Natural History collections, DNA-derived sequences and citations of materials examined when publishing taxonomic treatments in journal articles.

Records that have been included in a cluster can be found with the “is in cluster” filter in the occurrence search. Each occurrence page belonging to a cluster will have a “CLUSTER” tab displaying the potentially related records (see a screenshot of this example below).

You can read this news item for more information and some exciting examples.

Clustering records

Step 1: Select candidates

Comparing nearly 2 billions records with each other is very resource intensive and quite impractical, so the first step of the data-clustering process is to select and group candidate records to compare.

The system first creates a series of "hashes" for each record based on specified fields. All records sharing a hash are candidates to compare against each other.

For example, one of the "hashes" used is based on the species key, rounded coordinates, year, month and day. This means that the records that share the same values for those fields will be grouped together in the candidate table for further inspection.

The fields used to identify and group the candidates are a subset of what is used later on for comparing them (see the table below). See the source code to check the details.

Step 2: Compare and assess

In this second phase, the system will compare all records in the candidate set to each other and generate assertions. The assertions are inspected and the algorithms decides if there is enough evidence in the assertions to create a relationship between them.

The table below summarizes how those assertions are made but for more details, check the source code.

Assertion Fields checked Condition checked

Assertion	Fields checked	Condition checked
Same specimen	`taxonKey`, `typeStatus`	same taxonKey between records and typeStatus is "Holotype" for both records
Typification relation	`scientificName`, `typeStatus`	same between records
Same accepted species	`speciesKey`	same between records
Same date	`eventDate` or `day`, `month`, `year`	same between records
Approximate date	`day`, `month`, `year`	dates one day apart
Different date	`eventDate`	differs between records
Non conflicting date	`eventDate`, `day`, `month`, `year`	no date on either record
Same recorder name	`recordedBy`	same between records
Same coordinates	`decimalLatitude`, `decimalLongitude`	same between records
Non conflicting Coordinates	`decimalLatitude`, `decimalLongitude`	no coordinate on one or both sides
Within 200 m	`decimalLatitude`, `decimalLongitude`	distance ≤ 0.200km
Within 2 km	`decimalLatitude`, `decimalLongitude`	distance ≤ 2.00km
Same country	`countryCode`	same between records
Non conflicting country	`countryCode`	country only on one record
Different country	`countryCode`	differs between records
Identifiers overlap	`occurrenceID`, `fieldNumber`, `recordNumber`, `otherCatalogueNumber`, grouped(`institutionCode`, `collectionCode`, `catalogueNumber`), grouped(`institutionCode`, `catalogueNumber`)	checks any overlap of identifiers between records
Other catalogue number overlap	`otherCatalogueNumber`, grouped(`institutionCode`, `collectionCode`, `catalogueNumber`) with any separator	checks if the other catalogue number correspond to the institution code, collection code and catalogue number from another record
Is from sequence repository	`datasetKey`	checks if one of the datasetKey corresponds to one of the sequence repository datasets: INSDC sequences, INSDC host organisms, INSDC environmental samples, iBOL (see keys `d8cd16ba-bb74-4420-821e-083f2bac17c2`, `393b8c26-e4e0-4dd0-a218-93fc074ebf4e`, `583d91fe-bbc0-4b4a-afe1-801f88263016` and `040c5662-da76-4782-a48e-cdea1892d14c`)
Are specimens	`basisOfRecord`	checks if the basis of record for both records are one of the following: `PRESERVED_SPECIMEN`, `LIVING_SPECIMEN`, `FOSSIL_SPECIMEN`, `MATERIAL_CITATION`

Same specimen

taxonKey, typeStatus

same taxonKey between records and typeStatus is "Holotype" for both records

Typification relation

scientificName, typeStatus

same between records

Same accepted species

speciesKey

same between records

Same date

eventDate or day, month, year

same between records

Approximate date

day, month, year

dates one day apart

Different date

eventDate

differs between records

Non conflicting date

eventDate, day, month, year

no date on either record

Same recorder name

recordedBy

same between records

Same coordinates

decimalLatitude, decimalLongitude

same between records

Non conflicting Coordinates

decimalLatitude, decimalLongitude

no coordinate on one or both sides

Within 200 m

decimalLatitude, decimalLongitude

distance ≤ 0.200km

Within 2 km

decimalLatitude, decimalLongitude

distance ≤ 2.00km

Same country

countryCode

same between records

Non conflicting country

countryCode

country only on one record

Different country

countryCode

differs between records

Identifiers overlap

occurrenceID, fieldNumber, recordNumber, otherCatalogueNumber, grouped(institutionCode, collectionCode, catalogueNumber), grouped(institutionCode, catalogueNumber)

checks any overlap of identifiers between records

Other catalogue number overlap

otherCatalogueNumber, grouped(institutionCode, collectionCode, catalogueNumber) with any separator

checks if the other catalogue number correspond to the institution code, collection code and catalogue number from another record

Is from sequence repository

datasetKey

checks if one of the datasetKey corresponds to one of the sequence repository datasets: INSDC sequences, INSDC host organisms, INSDC environmental samples, iBOL (see keys d8cd16ba-bb74-4420-821e-083f2bac17c2, 393b8c26-e4e0-4dd0-a218-93fc074ebf4e, 583d91fe-bbc0-4b4a-afe1-801f88263016 and 040c5662-da76-4782-a48e-cdea1892d14c)

Are specimens

basisOfRecord

checks if the basis of record for both records are one of the following: PRESERVED_SPECIMEN, LIVING_SPECIMEN, FOSSIL_SPECIMEN, MATERIAL_CITATION

The table below summarizes the combinations of assertions that are sufficient to link the records in a cluster. If a group of occurrences share the combinations of assertions for any given column, they will be clustered together.

Table 1. Combination of assertion conditions to create cluster
Assertion
Same specimen	x
Typification relation		x
Same accepted species				x	x	x	x	x	x	x	x	x	x
Same date				x	x			x		x
Non-conflicting date						x	x		x				x
Approximate date											x	x
Same coordinates				x		x					x
Non-conflicting coordinates										x			x
Within 200m					x		x
Within 2km								x	x			x
Identifiers overlap						x	x	x	x	x			x
Other catalogue number overlap			x
Same recorder name											x	x
Is from sequence repository													x
Are specimen			x

Any group of occurrence associated with the assertion Different date or Different country will not be clustered together.

Why are some occurrences not clustered?

It is possible that some occurrences check one of the combinations of assertions but aren’t shown as clustered yet. This could be the case for several reasons:

The occurrences are newly published. Right now, the clustering process is quite resource intensive and doesn’t run automatically. We need to trigger it manually. This means that it can take a few weeks before newly published occurrences get clustered.
The "duplicates" come from the same dataset. The clustering algorithm only compares occurrences across datasets, not within datasets.
There can be a delay between the moment the occurrences are clustered and the moment they become searchable with the "is in cluster" filter (this is due to some technical reasons a bit too long to explain in this post, but relate to updating the search indexes separately from the clustering table)

There could be other unforeseen reasons, and if in doubt, please contact us at helpdesk@gbif.org.

Improve linkage

If for one reason or another, you need to publish on GBIF occurrences for observations or specimens that you know are already on GBIF, how best to do it?

Make sure that you reuse the same identifiers as much as possible, including the formating. Same catalogue numbers, occurrenceID, etc.
Use the associatedOccurrences term and resource relationship extension. These are not used during the clustering today, but are expected to be in the future, and are the correct way to communicate relationships within Darwin Core.

How to contribute

If you have suggestions to improve the clustering feature or questions on how it works, you are very welcome to use the feedback feature on this page.