GBIF forecast: increasing chance of clouds for species occurrence data

Azure-ready data snapshot provides potential support for GEO-Microsoft Planetary Computer Programme research grants

Tadarida-brasiliensis-iNat-ccampbell-hero
Brazilian free-tailed bat (Tadarida brasiliensis), United States. Photo 2021 Caitlin Campbell via iNaturalist Research-grade Observations, licensed under CC BY-NC 4.0.

A dataset containing nearly 1.3 billion species occurrence records from the GBIF network is now available for use through the Microsoft Planetary Computer data catalogue. Currently comprised of openly licensed and georeferenced records shared through GBIF through mid-April, this snapshot provides users of the Microsoft Azure cloud-computing environment with easy access to primary biodiversity data in consistent, analysis-ready formats.

Developed by Microsoft AI for Earth, the Planetary Computer combines petabytes of global-scale environmental monitoring data and makes it readily available to users of the large-scale virtual computing system. Documentation, sample notebook and a post by GBIF data analyst John Waller on the GBIF Data Blog outline how users can start to access GBIF-mediated data from blob storage on Azure, whether inside or outside of the Planetary Computer.

The entry into the Planetary Computer catalogue is timely, given its potential to support early adopters' grant applications for up to US $60,000 in both funding and computing credits along with other resources through the GEO-Microsoft Planetary Computer Programme. The deadline to submit 12-month research proposals applying the Planetary Computer to grand environmental challenges described in the current work programme of the Group on Earth Observations (GEO) is 15 June 2021.

"Having routine access to a complete and up-to-date collection of species occurrences through the Planetary Computer will greatly enhance the contribution that the GBIF network can make to deriving indicators for the UN Convention on Biological Diversity's post-2020 Global Biodiversity Framework," said Simon Ferrier, a chief research scientist at CSIRO, Australia's national science agency.

Ferrier, project co-lead Andrew Hoskins and colleagues are developing a solution that applies state-of-the-art machine-learning tools on the Azure platform to extract the signal of biodiversity change from masses of less-structured observation data. "This is made possible only by co-locating GBIF-mediated data and remote-sensing data on land cover and climate change alongside the high performance computational capability that the Azure cloud-computing platform provides," said Hoskins. "The approach opens up whole new opportunities for monitoring change in our planet’s biodiversity."

Ferrier and Hoskins' host institution CSIRO also hosts the Atlas of Living Australia, which coordinates the country's national-scale GBIF activities.

The GBIF Secretariat aims to update the snapshots monthly, in order to keep pace with the dynamic and ever-changing data available through GBIF.org and the GBIF API. The snapshots will continue to include all records shared through GBIF under CC0 and CC BY designations that have coordinates which have passed automated quality checks.

This first snapshot contains records on 939,601 species drawing from 22,517 datasets and 1,029 data-publishing institutions, and, like all datasets shared through GBIF, has been assigned a DOI, or digital object identifier, that maintains a persistent and transparent record of its sources. Given the Secretariat's success in developing one of the world's leading systems for data citation and attribution, preserving provenance has remained an emphasis while introducing the data into new research and computing communities.

To that end, GBIF developers have created a new service that produces citable records of 'derived datasets' and enables cloud-compute users to follow citation guidelines and best practices. Science communications coordinator Daniel Noesgaard has described the tool in the GBIF data blog, which is available both through the GBIF API and the GBIF.org interface and lets users account for the fact that they may run analyses against a significantly filtered portion of the data available in any given snapshot. Citing the DOI assigned to the resulting derived dataset will improve citation accuracy and tracking while ensuring the analytic transparency and reproducibility.

More is planned for a similar snapshot already placed in the Registry of Open Data on Amazon Web Services and another being prepared for inclusion in the public datasets available through Google BigQuery. Taken together, these developments signal the first step toward enabling cloud-computing systems to "help foster novel research, lower technical barriers of large-scale data analysis and raise the visibility" of the GBIF network, as GBIF head of informatics Tim Robertson suggested last February.

April 2021 snapshot, by the numbers

Kingdom Number of species Number of records
Animalia 531,074 1,064,194,305
Plantae 315,369 193,391,585
Fungi 64,291 12,210,273
Chromista 18,113 9,440,667
Bacteria 8,364 13,313,089
Protozoa 1,635 793,176
Virus 383 42,019
Archaea 216 226,905
incertae sedis 164 4,223,619
TOTAL 939,601 1,297,835,638