Species occurrence cubes

Species occurrence cubes are views of GBIF occurrence data, filtered and summarized by spatial, temporal and/or taxonomic dimensions, and providing measures for each dimension.

They can be produced using the GBIF website, the GBIF API, or by using the GBIF occurrence data deposited on external cloud services. This page covers generating a cube using GBIF’s website or API. See Generate a species occurrence cube using Microsoft Azure Databricks to generate a cube using a public cloud service.

Demonstration cube specification

The guide below will produce a data cube for occurrences of animals from Poland, recorded during or after 2000.

The dimensions of the cube will be

  1. Species

    • We must therefore exclude occurrences not identified to species level

    • We will also exclude occurrences with identifications noted to be uncertain

  2. Year-Month

    • We must exclude occurrences with only year accuracy, or with an imprecise date spanning multiple months

  3. EEA Reference Grid (PDF) with a 1km grid size

    • We will randomize the point using the coordinateUncertaintyInMeters, defaulting to 1000m

    • We will exclude occurrences without coordinates

    • We will exclude occurrences with a coordinateUncertaintyInMeters greater than 1000m

  4. Sex

    • This is included to demonstrate a basic dimension

  5. Life stage

    • This is included to demonstrate a more complex field (a vocabulary field) used for a dimension

The measures of the cube will be

  1. Occurrence count

  2. Minimum coordinate uncertainty

  3. Minimum temporal uncertainty

To account for sampling bias, we will also count occurrences at family level.

Create an initial GBIF search filter

Perhaps using the GBIF.org website, find an approximate search filter, for example for a particular taxon, country or region, year range, etc. This will need to be expressed in SQL for submission to the API:

WHERE occurrenceStatus = 'PRESENT'
  AND countryCode = 'PL' (1)
  AND "year" >= 2000 (2)
  AND kingdomKey = 1 (3)
  AND hasCoordinate = TRUE (4)
1 Poland
2 YEAR(…) (or year(…)) is an SQL function, so we must quote the name of the column as "year" using double quotes.
3 Animalia
4 Include only occurrences with coordinates

This corresponds to this query on the website.

Exclude unwanted data

The GBIF website and basic search APIs do not allow excluding data from searches, but this is often required for data cubes. We add additional expressions to our WHERE clause:

  AND (coordinateUncertaintyInMeters <= 1000 OR coordinateUncertaintyInMeters IS NULL) (1)
  AND speciesKey IS NOT NULL (2)
  AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE') (3)
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_INVALID')
  AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
  AND "month" IS NOT NULL (4)
1 Exclude occurrences with a large coordinate uncertainty, but keep those without an uncertainty
2 Exclude occurrences not identified to species level
3 Exclude occurrences with these georeferencing issues
4 Exclude occurrences without a month, or spanning multiple months
If requesting a data cube using the website, additional WHERE clauses like this can be added by choosing "Edit as SQL" after choosing the basic cube download parameters.

Prepare dimension expressions

The dimensions are the species, year-month, 1km EEA grid, sex and life stage.

For species we could choose either the key (numeric identifier), the name, or both:

speciesKey

The year and month fields are separate, we want both in the form 2024-03:

PRINTF('%04d-%02d', "year", "month") AS yearMonth (1)
1 YEAR(…) (or year(…)) and MONTH(…) are SQL functions, so we must quote the name of the columns as "year" and "month" using double quotes.

Arguments for the grid function are given on the SQL functions page. We use an EEA reference grid with 1km grid cell size:

GBIF_EEARGCode(
  1000, (1)
  decimalLatitude,
  decimalLongitude,
  COALESCE(coordinateUncertaintyInMeters, 1000) (2)
) AS eeaCellCode,
1 The grid size, 1000m
2 COALESCE sets a default value (1000) if the first value is absent.

Sex and life stage should both show NOT_SUPPLIED when the value is not provided:

COALESCE(sex, 'NOT_SUPPLIED') AS sex,
COALESCE(occurrence.lifestage.concept, 'NOT_SUPPLIED') AS lifestage, (1)
1 The lifestage field is hierarchical. We take the main value, not the whole hierarchy.

Prepare measurement expressions

The measures of the cube are the occurrence count:

COUNT(*) AS occurrences

and the minimum coordinate uncertainty and temporal uncertainty:

MIN(COALESCE(coordinateUncertaintyInMeters, 1000)) AS minCoordinateUncertaintyInMeters,
MIN(GBIF_TemporalUncertainty(eventDate)) AS minTemporalUncertainty

Add sampling bias expression

Measuring sampling bias requires including an occurrence count for a higher taxon, in this case the family. This can be done using an SQL window function. This is a more advanced SQL technique, see the PostgreSQL documentation which has a good introduction to the concept.

IF(ISNULL(familyKey), NULL, SUM(COUNT(*)) OVER (PARTITION BY familyKey)) AS familyCount

-- The same, shown on several lines:
IF(ISNULL(familyKey), (1)
   NULL, (2)
     SUM(COUNT(*)) (3)
     OVER (PARTITION BY familyKey, (4)
                        PRINTF('%04d-%02d', "year", "month"),
                        GBIF_EEARGCode(1000, decimalLatitude, decimalLongitude, COALESCE(coordinateUncertaintyInMeters, 1000)))
) AS familyCount
1 Test for a null family; we still want to include species without a family in GBIF’s taxonomic backbone
2 Value to return for a null family
3 Measurement for the partition, the sum of the counts of the number of occurrences by species within this family
4 Partition by the chosen family, keeping the other dimensions of yearMonth and eeaCellCode

Write full query

The WHERE clause is the filter, and the dimensions and measurements are SELECT expressions which must also appear in the GROUP BY clause.

Typical data cubes will use an SQL query similar to this one:

SELECT (1)
  -- Dimensions: (2)
  PRINTF('%04d-%02d', "year", "month") AS yearMonth,
  GBIF_EEARGCode(
    1000,
    decimalLatitude,
    decimalLongitude,
    COALESCE(coordinateUncertaintyInMeters, 1000)
  ) AS eeaCellCode,
  familyKey,
  family,
  speciesKey,
  species,
  COALESCE(sex, 'NOT_SUPPLIED') AS sex,
  COALESCE(occurrence.lifestage.concept, 'NOT_SUPPLIED') AS lifestage,
  -- Measurements (3)
  COUNT(*) AS occurrences, (4)
  MIN(COALESCE(coordinateUncertaintyInMeters, 1000)) AS minCoordinateUncertaintyInMeters,
  MIN(GBIF_TemporalUncertainty(eventDate)) AS minTemporalUncertainty,
  -- Higher taxon measurement
  IF(ISNULL(familyKey), (1)
     NULL, (2)
       SUM(COUNT(*)) (3)
       OVER (PARTITION BY familyKey, (4)
                          PRINTF('%04d-%02d', "year", "month"),
                          GBIF_EEARGCode(1000, decimalLatitude, decimalLongitude, COALESCE(coordinateUncertaintyInMeters, 1000)))
  ) AS familyCount
FROM
  occurrence
WHERE occurrenceStatus = 'PRESENT'
  AND countryCode = 'PL'
  AND "year" >= 2000
  AND kingdomKey = 1
  AND hasCoordinate = TRUE
  AND (coordinateUncertaintyInMeters <= 1000 OR coordinateUncertaintyInMeters IS NULL)
  AND speciesKey IS NOT NULL
  AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_INVALID')
  AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
  AND "month" IS NOT NULL
GROUP BY
  yearMonth,
  eeaCellCode,
  familyKey,
  family,
  speciesKey,
  species,
  sex,
  lifestage
ORDER BY (5)
  yearMonth DESC,
  eeaCellCode ASC,
  speciesKey ASC;
1 These are the columns in the resulting cube, i.e. the dimensions and measurements for the cube.
2 The dimensions must also appear in the GROUP BY section and usually also in any PARTITION BY expressions. They can include functions.
3 The measurements must be SQL aggregate functions, like COUNT, MIN, MAX, AVERAGE, SUM etc
4 AS gives a name to the column, used as the header in the result file
5 The ORDER BY section is optional.

See API SQL Downloads – Functions for descriptions and arguments for the functions, including the GBIF_EEARGCode (EEA reference grid) function.

Submit the query to GBIF

See API SQL Downloads for instructions.

Note the queries above are shown in SQL format. The double quote characters, newlines and comments must be escaped/removed when converting the query to the JSON request for the API. The submitted request is available here.

A cube made using this query is available here.

yearmonth eeacellcode familykey family specieskey species sex lifestage occurrences mincoordinateuncertaintyinmeters mintemporaluncertainty familycount

2025-02

1kmE4612N3427

3454

Mactridae

5188971

Rangia cuneata

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4612N3434

6858

Cardiidae

2286585

Cerastoderma glaucum

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3421

2986

Anatidae

2498273

Clangula hyemalis

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3433

6858

Cardiidae

2286585

Cerastoderma glaucum

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3434

3476

Mytilidae

2285681

Mytilus trossulus

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3434

6858

Cardiidae

2286585

Cerastoderma glaucum

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3434

3454

Mactridae

5188971

Rangia cuneata

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4613N3435

3476

Mytilidae

2285681

Mytilus trossulus

NOT_SUPPLIED

NOT_SUPPLIED

1

1000.0

1

1

2025-02

1kmE4622N3382

6950

Geometridae

4524740

Phigalia pilosaria

Male

Adult

1

1000.0

1

1

2025-02

1kmE4643N3450

5235

Corvidae

2482515

Corvus cornix

NOT_SUPPLIED

Unknown

1

5.0

60

1