Copyright Compliance and Licensing Patterns in African NLP Datasets: A Data-Driven Governance Analysis

Delwyn Gordon , Nokhutula Madekufamba

Partner: ip-law

Year: 2026

Abstract: The rapid growth of African Natural Language Processing (NLP) has increased the development and reuse of language datasets that support modern artificial intelligence systems. However, the copyright status, licensing terms, and governance practices of many of these datasets remain unclear. This project investigates copyright compliance and licensing patterns across 249 African NLP projects using a structured governance dataset developed by the Data Science Law Lab at the University of Pretoria. Using exploratory data analysis, statistical modelling, clustering techniques, and interactive visualisation, the study evaluated copyright and licensing risk and examined how governance practices relate to factors such as funding source, authorship geography, and commercialisation intent. The results reveal significant governance challenges, with 57% of projects classified as medium risk and 37% as high risk. Common issues include unclear copyright status, undocumented permissions, ambiguous licensing, and limited use of rights-management mechanisms.

Presentation Video