Data cubes

This is an experimental feature, and the implementation may change throughout 2024. The feature is currently only available for preview by invited users.

Data cubes are views of GBIF occurrence data, filtered and summarized by spatial, temporal and/or taxonomic dimensions, and providing measures for each dimension.

They can be produced using a prototype GBIF API, or using the GBIF occurrence data deposited on external cloud services. This page covers generating a cube using GBIF’s API. See Generate a species occurrence cube using Microsoft Azure Databricks to generate a cube using a public cloud service.

Demonstration cube specification

The guide below will produce a data cube for occurrences of animals from Poland, recorded during or after 2000.

The dimensions of the cube will be

  1. Species

    • We must therefore exclude occurrences not identified to species level

    • We will also exclude occurrences with identifications noted to be uncertain

  2. Year-Month

    • We must exclude occurrences with only year accuracy, or with an imprecise date spanning multiple months

  3. EEA Reference Grid (PDF) with a 1km grid size

    • We will randomize the point using the coordinateUncertaintyInMeters, defaulting to 1000m

    • We will exclude occurrences without coordinates

    • We will exclude occurrences with a coordinateUncertaintyInMeters greater than 1000m

  4. Sex

    • This is included to demonstrate a basic dimension

  5. Life stage

    • This is included to demonstrate a more complex field (a vocabulary field) used for a dimension

The measures of the cube will be

  1. Occurrence count

  2. Minimum coordinate uncertainty

  3. Minimum temporal uncertainty

To account for sampling bias, we will also count occurrences at family level.

Create an initial GBIF search filter

Perhaps using the GBIF.org website, find an approximate search filter, for example for a particular taxon, country or region, year range, etc. This will need to be expressed in SQL for submission to the API:

WHERE occurrenceStatus = 'PRESENT'
  AND countryCode = 'PL' (1)
  AND year >= 2000
  AND kingdomKey = 1 (2)
  AND hasCoordinate = TRUE (3)
1 Poland
2 Animalia
3 Include only occurrences with coordinates

This corresponds to this query on the website.

Exclude unwanted data

The GBIF website and basic search APIs do not allow excluding data from searches, but this is often required for data cubes. We add additional expressions to our WHERE clause:

  AND (coordinateUncertaintyInMeters <= 1000 OR coordinateUncertaintyInMeters IS NULL) (1)
  AND speciesKey IS NOT NULL (2)
  AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE') (3)
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_INVALID')
  AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
  AND month IS NOT NULL (4)
1 Exclude occurrences with a large coordinate uncertainty, but keep those without an uncertainty
2 Exclude occurrences not identified to species level
3 Exclude occurrences with these georeferencing issues
4 Exclude occurrences without a month, or spanning multiple months

Prepare dimension expressions

The dimensions are the species, year-month, 1km EEA grid, sex and life stage.

For species we could choose either the key (numeric identifier), the name, or both:

speciesKey

The year and month fields are separate, we want both in the form 2024-03:

PRINTF('%04d-%02d', "year", "month") AS yearMonth (1)
1 YEAR(…) (or year(…)) and MONTH(…) are SQL functions, so we must quote the name of the columns as "year" and "month" using double quotes.

Arguments for the grid function are given on the SQL functions page. We use an EEA reference grid with 1km grid cell size:

GBIF_EEARGCode(
  1000, (1)
  decimalLatitude,
  decimalLongitude,
  COALESCE(coordinateUncertaintyInMeters, 1000) (2)
) AS eeaCellCode,
1 The grid size, 1000m
2 COALESCE sets a default value (1000) if the first value is absent.

Sex and life stage should both show NOT_SUPPLIED when the value is not provided:

COALESCE(sex, 'NOT_SUPPLIED') AS sex,
COALESCE(occurrence.lifestage.concept, 'NOT_SUPPLIED') AS lifestage, (1)
1 The lifestage field is hierarchical. We take the main value, not the whole hierarchy.

Prepare measurement expressions

The measures of the cube are the occurrence count:

COUNT(*) AS occurrences

and the minimum coordinate uncertainty and temporal uncertainty:

MIN(COALESCE(coordinateUncertaintyInMeters, 1000)) AS minCoordinateUncertaintyInMeters,
MIN(GBIF_TemporalUncertainty(eventDate)) AS minTemporalUncertainty

Add sampling bias expression

Measuring sampling bias requires including an occurrence count for a higher taxon, in this case the family. This can be done using an SQL window function. This is a more advanced SQL technique, see the PostgreSQL documentation which has a good introduction to the concept.

IF(ISNULL(familyKey), NULL, SUM(COUNT(*)) OVER (PARTITION BY familyKey)) AS familyCount

-- The same, shown on several lines:
IF(ISNULL(familyKey), (1)
   NULL, (2)
     SUM(COUNT(*)) (3)
     OVER (PARTITION BY familyKey) (4)
) AS familyCount
1 Test for a null family; we still want to include species without a family in GBIF’s taxonomic backbone
2 Value to return for a null family
3 Measurement for the partition, the sum of the counts of the number of occurrences by species within this family
4 Partition by the family

Write full query

The WHERE clause is the filter, and the dimensions and measurements are SELECT expressions which must also appear in the GROUP BY clause.

Typical data cubes will use an SQL query similar to this one:

SELECT (1)
  -- Dimensions: (2)
  PRINTF('%04d-%02d', "year", "month") AS yearMonth,
  GBIF_EEARGCode(
    1000,
    decimalLatitude,
    decimalLongitude,
    COALESCE(coordinateUncertaintyInMeters, 1000)
  ) AS eeaCellCode,
  familyKey,
  family,
  speciesKey,
  species,
  COALESCE(sex, 'NOT_SUPPLIED') AS sex,
  COALESCE(occurrence.lifestage.concept, 'NOT_SUPPLIED') AS lifestage,
  -- Measurements (3)
  COUNT(*) AS occurrences, (4)
  MIN(COALESCE(coordinateUncertaintyInMeters, 1000)) AS minCoordinateUncertaintyInMeters,
  MIN(GBIF_TemporalUncertainty(eventDate)) AS minTemporalUncertainty,
  -- Higher taxon measurement
  IF(ISNULL(familyKey), NULL, SUM(COUNT(*)) OVER (PARTITION BY familyKey)) AS familyCount
FROM
  occurrence
WHERE occurrenceStatus = 'PRESENT'
  AND countryCode = 'PL'
  AND "year" >= 2000
  AND kingdomKey = 1
  AND hasCoordinate = TRUE
  AND (coordinateUncertaintyInMeters <= 1000 OR coordinateUncertaintyInMeters IS NULL)
  AND speciesKey IS NOT NULL
  AND NOT ARRAY_CONTAINS(issue, 'ZERO_COORDINATE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_OUT_OF_RANGE')
  AND NOT ARRAY_CONTAINS(issue, 'COORDINATE_INVALID')
  AND NOT ARRAY_CONTAINS(issue, 'COUNTRY_COORDINATE_MISMATCH')
  AND "month" IS NOT NULL
GROUP BY
  yearMonth,
  eeaCellCode,
  familyKey,
  family,
  speciesKey,
  species,
  sex,
  lifestage
ORDER BY (5)
  yearMonth DESC,
  eeaCellCode ASC,
  speciesKey ASC;
1 These are the columns in the resulting cube, i.e. the dimensions and measurements for the cube.
2 The dimensions must also appear in the GROUP BY section. They can include functions.
3 The measurements must be SQL aggregate functions, like COUNT, MIN, MAX, AVERAGE, SUM etc
4 AS gives a name to the column, used as the header in the result file
5 The ORDER BY section is optional.

See API SQL Downloads – Functions for descriptions and arguments for the functions, including the GBIF_EEARGCode (EEA reference grid) function.

Submit the query to GBIF

See API SQL Downloads for instructions.

A cube using this query is available here.

yearmonth eeacellcode familykey family specieskey species sex lifestage occurrences mincoordinateuncertaintyinmeters mintemporaluncertainty familycount

2024-03

1kmE4630N3262

6537

Limacidae

2295438

Limax maximus

NOT_SUPPLIED

NOT_SUPPLIED

1

122

60

548

2024-03

1kmE4630N3262

5298

Cervidae

5220126

Capreolus capreolus

NOT_SUPPLIED

NOT_SUPPLIED

1

122

60

4877

2024-03

1kmE4630N3263

9608

Emberizidae

2491557

Emberiza schoeniclus

NOT_SUPPLIED

NOT_SUPPLIED

1

31

60

57742

2024-03

1kmE4631N3260

9342

Rallidae

5228199

Gallinula chloropus

NOT_SUPPLIED

NOT_SUPPLIED

1

31

60

26391

2024-03

1kmE4631N3263

7782

Coccinellidae

1043171

Calvia quatuordecimguttata

NOT_SUPPLIED

NOT_SUPPLIED

1

1000

60

4996

2024-03

1kmE4632N3257

2986

Anatidae

2498036

Anser anser

NOT_SUPPLIED

NOT_SUPPLIED

1

122

60

186923

2024-03

1kmE4632N3258

2986

Anatidae

2498252

Alopochen aegyptiaca

NOT_SUPPLIED

NOT_SUPPLIED

1

61

60

186923