
October 29, 2025
Author: Bolaji Ayodeji, DPG Evangelist and Technical Coordinator, DPGA Secretariat
Last year, the DPGA Secretariat launched its first-ever set of Calls for Collaborative Action following discussions with experts in the open source ecosystem, digital public infrastructure, climate action, and public interest AI. These calls are designed to galvanise support and signal to stakeholders the actions they can take to contribute to the success of digital public goods in highly impactful areas. One of those calls is the Open Data for Public Interest AI, which clamoured for DPGs that can make identifying, preparing, sharing, and using higher-quality open training data easier. The development of public interest AI depends on the opportunity to train models on both existing and new high-quality openly licensed datasets. In a time where generative AI is advancing at breakneck speed, and the term “open-source AI” is often misconstrued to describe systems that fall on varying degrees of openness, such as releasing model weights without transparency around the training data. Thus, it has become increasingly imperative to work towards a transparent and open way of building AI systems that serve the public interest.
Several challenges, including infrastructure limitations, funding constraints, and limited access to open solutions, exist that impede this at a larger scale, underscoring the need for greater resources to produce and share open data across diverse geographical contexts. In an earlier blog post by the DPGA Secretariat CEO, Liv Marte Nordhaug, she mentioned that “DPGs, as open, adaptable digital solutions, with documentation that can help facilitate reuse, can play an important role as tools for addressing common challenges to scaling public interest AI – both in the near future and longer term. In particular, DPGs can help unlock more and higher-quality open training data and data sharing.” Thus, over the past several months, we have focused on exploring how DPGs can help reduce some of the technical barriers to having more high-quality open training data, particularly for use cases like the development of language models that address language gaps in AI development, solutions for public service delivery, and research-based climate action (monitoring, mitigation, adaptation). One fundamental way we addressed this challenge, with multiple stakeholders participating on the call, was by creating an adaptable and reusable toolkit that can be recommended to countries and stakeholders to facilitate the collection, extraction, processing, validation, and preparation of data.
Our key activity during the past months was the development of this toolkit, which includes a list of existing DPGs, potential DPGs, and other open-source tools that could be relevant for advancing public interest AI. We use four core terminologies to summarise the focus areas of the solutions and their relevance to the call to action:
Our initial observation from compiling the first version of the toolkit showed that the existing DPGs included are primarily focused on data collection, with a few also involved in data validation/processing. This already indicates a gap in data identification, validation, and processing. We then aimed to coordinate stakeholder efforts that would ensure potential solutions on the list could become certified DPGs while sharing the toolkit with relevant open-source community members who are building datasets that could eventually lead to the creation of new AI DPGs on the DPG Registry.
Our coordination efforts started by gathering stakeholders (many of whom included DPGA members and DPG product owners) together to garner their initial thoughts on how this topic affects their current workstreams, including Creative Commons, Geoprism Registry, Open Knowledge Foundation, Government of the Dominican Republic, Open Future, BMZ (GIZ FAIR Forward), and Open Data Services. We also had exploratory discussions with more than ten dataset creators to learn more about the challenges they faced in collecting and developing their datasets, any unique open-source digital solutions that supported the process, and any they wished had existed at the time. All these inputs helped validate our focus on tools for this call as one means of advancing the open data problem. Below is an overview of the aspects that the consulted stakeholders work on, which are of relevance to the C4CA and the toolkit.
We also discovered complementary toolkits developed by various organisations, including educational materials, resources, templates, and checklists, which all aid different stakeholders in the production of open datasets. Here are a few examples:
When we spoke to the developers of impressive datasets like the Simula Datasets, Africa Biomass, Hyperlocal Mapping of Air Pollution, Inclusive Digital Monitoring System in the Eastern Himalayas, High Carbon Stock Approach for Forest Protection, Kenyan Language Corpora, Masakhane African Languages Hub, etc., we discovered several challenges that affirmed the need for more technical tools and resources. For example, the team working with fishing communities faced complications in the data collection, due to the diversity of dialects across communities, with language barriers arising between those who spoke the local languages and those responsible for gathering the data. Environmental challenges, such as sensor failures due to heat and maintenance issues, further slowed progress. Although Excel was primarily used for processing, there was a need for better open-source tools to automate sensor data and improve interoperability. Working with universities helped refine and analyse the data, but issues around data ownership, transparency in payment, release agreements, and government buy-in persisted. Others working on biomass data relied heavily on paper-based processes, later digitised, but this often led to inefficiencies—data scientists had to review information line by line, while field teams repeatedly verified corrections, creating a time-consuming back-and-forth process. For those building on existing data, broader systemic challenges included unclear or missing data licenses that caused long delays, fragmented storage without a central repository, and restricted access to locked or government-owned datasets. Those working on medical data rely on their in-house internal data processing pipeline. For data labelling, they partnered with proprietary platforms that offer a free academic license and strong compliance with GDPR and privacy regulations. While they explored open-source alternatives, most required local setups and lacked the scalability needed for multi-expert remote collaboration.
A broader challenge lies in the economics surrounding data sharing. Many see datasets as commercial assets rather than public goods, leading to the buying, selling, or locking of data, instead of open collaboration. Overall, progress is steady but slow—people are warming up to the idea that openness and privacy can coexist, and that responsible data sharing can drive scientific and AI advancement for the common good. Altogether, these experiences underscore a clear need for open, community-centred, and interoperable digital solutions that streamline data collection and validation, ensure standardisation, improve transparency, and promote collaboration between local communities, researchers, and government stakeholders—encouraging them to consider sharing national data openly to aid public interest AI innovation.
Although the toolkit and the ongoing work of stakeholders mentioned earlier provide a first step in closing the identified gaps in enabling more and better open data for public interest AI, much still needs to be done to improve the availability and quality of open data resources for AI training. For this reason, the toolkit was established as a dynamic collection that allows for real-time updates, rather than a static resource. We aim to expand it through our research, sourcing and resource assessments over the next year, developing a more granular understanding of the needs of the toolkit’s target groups and adding further resources. Additionally, we plan to demonstrate the use of the products included in the toolkit through several case studies, thereby bridging the gap between theory and practice. Our work in 2026 will also focus on sourcing more AI System DPGs by engaging the open science community, exploring the creation of open datasets for SDG-relevant models, fine-tuning, and public benchmarks that are representative of the lived realities in global majority contexts. In the spirit of the Calls for Collaborative Action, we urge DPGA members, stakeholders, and partners to collectively support this effort to expand the open data for public interest AI ecosystem. Only collectively can we realise the potential of public interest AI and thereby make progress toward the SDGs.