Copyright Compliance and Licensing Patterns in African NLP Datasets: A Data-Driven Governance Analysis

Tankiso Kolobe , Andries Monyebodi

Partner: ip-law

Year: 2026

Abstract: African languages NLP datasets are becoming more important for AI development in African contexts, but many datasets lack licensing and governance documentation, creating legal risk for researchers. This research analyzed 249 African NLP projects by combining rule-based scoring and machine learning. A simple scoring system based on four legal attributes, copyright status, licensing clarity, legal basis for use, and whether IP analysis was done, was used to classify projects. Logistic regression outperformed Decision Tree and Random Forest models, scoring a macro F1 of 0.88. Medium risk dominated with 144 projects while high and low risk had 60 and 45 projects respectively. Projects were also grouped into five topologies using unsupervised clustering. A Streamlit dashboard was developed to visualize the key findings and act as decision support by integrating descriptive analytics and an interactive prediction interface.

Presentation Video