We call for DPGs that can make identifying, preparing, sharing, and using higher-quality open training data easier, particularly for the following use cases:

Identify, create, and fund relevant open-source technical and governance toolkits, case studies and capacity building efforts, that can increase the availability and use of high-quality open training data particularly for the following use cases:
The development of public interest AI, including AI systems as digital public goods, depends on the opportunity to train models on both existing and new high-quality openly licensed datasets. Many challenges exist that impede doing this at a larger scale, one of which is the resources required to produce and share open data in different geographical contexts. One way to address this challenge is by creating an adaptable and reusable toolkit that can be recommended to countries and stakeholders to facilitate the collection, extraction, processing, and preparation of data.
Generative AI is advancing at break-neck speed, and the term “open-source AI” is often misused to describe systems that only have open weights but where there is no transparency and sharing of the data the system has been trained on. This lack of transparency poses a significant risk, as these systems are increasingly shaping our norms, values, understanding of reality, and access to information and services at the most fundamental level. It is urgent to overcome barriers to a more transparent and open way of building AI systems that serve the public interest. This includes reducing some of the main technical barriers to having more high-quality open training data.