ChatIPT system wins the 2024 Ebbe Nielsen Challenge

Assistant developed by Norwegian engineer Rukaya Johaadien helps transform spreadsheets into standardized GBIF-ready datasets; Planetary Knowledge Base and CoreTech Assistant place second and third in annual incentive prize

rukaya-johaadien-hero
Rukaya Johaadien, head engineer from GBIF Norway and winner of the first prize entry in the 2024 Ebbe Nielsen Challenge. Photo courtesy of Ms Johaadien.

ChatIPT, a chatbot that cleans and standardizes spreadsheets, creates basic metadata and guides students and researchers through the process of publishing data into the GBIF network, has won GBIF's 2024 Ebbe Nielsen Challenge.

Developed by Rukaya Johaadien, head engineer at GBIF Norway in the Natural History Museum, University of Oslo, ChatIPT helps new or occasional data publishers without specialized knowledge transform a raw, unformatted spreadsheet and share a standardized dataset on GBIF.org.

The annual incentive prize honours the memory of Dr Ebbe Schmidt Nielsen, a Danish-Australian entomologist who was one of the principal founders of GBIF and an inspiring leader in the biosystematics and biodiversity informatics community. An expert jury, led by Birgitte Gemeinholzer, professor of botany at the University of Kassel and chair of the GBIF Science Committee, reviewed a set of 11 eligble entries and selected second- and third-prize winners, respectively:

  • Planetary Knowledge Base, an automated transcription service for specimen developed by a team from the Natural History Museum, London, led by postdoc Gu "Hiris" Qianqian with Vince Smith and Ben Scott
  • CoreTech Assistant, an AI-based help desk for beginner-level data publishers built by Chen Yao, a Taiwanese crop scientist and volunteer at TaiBIF currently performing military service.

1st Prize: ChatIPT

Rukaya Johaadien's chatbot provides conversation-style support to students and researchers who hold biodiversity data but are first-time or infrequent data publishers. Its prompts guide users as it cleans and standardizes spreadsheets, creates basic metadata, and publishes well-structured datasets on GBIF.org as a Darwin Core Archive.

To date, publishing high quality data from PhD and Master's degrees and other small-scale biodiversity research studies has been difficult to do at scale. Standardizing data typically requires specialist knowledge of programming languages, data management techniques, and familiarity with specialist software.

Meanwhile, the process of gaining access to existing instances of the Integrated Publishing Toolkit (IPT)—the GBIF network's workhorse application for data sharing run by node staff with limited time and resources—can test a novice's patience. Training can do little to surmount such logistical barriers and others, like language, when occasional users forget the precise steps and details from year to year.

"Data standardization is hard, and biologists don't become biologists because they like coding or Excel, so a lot of potentially valuable data falls by the wayside," said Johaadien. "Recognizing that large language models have gotten really good at generating code and working with data, I built an automated tool to guide non-technical users through routine questions and process their messy data as much as possible, then publish it quickly and automatically to GBIF."

ChatIPT's selection marks the second year in a row that a Norwegian entry has emerged among the Ebbe's top finishers, following the 2023 third-prize win by Open Data Biodiversity Mapper developed by a three-person team from the Norwegian University of Science and Technology (NTNU).

Get acquainted with ChatIPT through its demo, video summary and GitHub repository

2nd Prize: Planetary Knowledge Base

Developed by Qianqian (Hiris) Gu, Ben Scott and Vince Smith of the Natural History Museum, London, this early prototype provides an automated transcription service that captures structured semantic data from specimen label images by leveraging large language models (LLMs) and Graph Convolutional Neural Networks (GCNNs). Through its innovative approach, the Planetary Knowledge Base (PKB) may transform processes for digitizing and analysing natural history collections.

This initial prototype provides a user interface for a transcription service. Users upload a herbarium sheet that extracts text from the label and aligns its information to nodes in the service's knowledge graph. With its initial focus on botanical specimens, the PKB returns an Open Digital Specimen object referencing plant taxa and specimens accessed through GBIF, geographic data from GeoNames and biographical data from WikiData, Bionomia, the Harvard Index of Botany, and TL2.

"The Planetary Knowledge Base is an experiment in unlocking the value of biodiversity data at scale, commensurate with challenges facing the planet, the volume of data now available and the latest in AI technologies," said Smith. "Our first service focuses on supporting the mass digitization of natural science collections, but we hope to grow the knowledge base as a platform to innovate new services that unlock the potential of this data."

For data users, the PKB's network knowledge graph enables deeper interrogation of current knowledge of collections. It may help to identify species in need of taxonomic review, flag potential data discrepancies and conflicts, or even detect potential errors, outliers and gaps in the knowledge graph, for example, by pointing out misidentified specimens that may represent new species.

Institutions that hold scientific collections may see significant savings of time and resources from PKB, reducing backlogs of undigitized specimens. By giving curators, researchers and other stakeholders access to a better and more accurate picture of the world's biodiversity, the PKB can also support more informed decision-making across topics from taxonomy to conservation policy.

Team member Ben Scott previously received third prize for his 2020 Challenge entry (with Ivvet Abdullah-Modinou) for Voyager. In 2008, Vince Smith earned recognition for his career achievements through the Ebbe Nielsen Prize, prior to its relaunch as an incentive challenge in 2014.

Learn more about the Planetary Knowledge Base from its demo and video

3rd Prize: CoreTech Assistant

Chen Yao's multilingual chatbot prototype leverages Retrieval-Augmented Generation (RAG) technology to bridge users' linguistic gaps while reducing the steep learning curve for the Darwin Core data standard.

CoreTech Assistant's developer is a crop scientist and self-taught composer and programmer who first joined the GBIF community at September 2023 workshop organized by TaiBIF, which he has served as a volunteer mentor and translator. His entry seeks to support to people—particularly beginners—who encounter issues as they seek to access, understand, and contribute biodiversity data any time, even the network's help desks are not available.

CoreTech Assistant's AI model bridges language gaps, retrieves expert guidance from available documentation, and simplifies both the preparation and use of Darwin Core-formatted data. The chatbot detects the user's language from their input, retrieving and providing technical information in their preferred language. By reducing barriers to entry for non-English speakers, CoreTech Assistant closes a gap in the sharing and use of biodiversity information, especially users in the countries of Asia.

Learn about CoreTech Assistant through its demo, video description and GitHub repository


Allocations from the Challenge's 2024 prize pool of €20,000 will distribute €10,000 to the first-place winner, €6,000 to second, and €4,000 to the third.

Jury for 2024 Ebbe Nielsen Challenge