Copyright Compliance and Licensing Patterns in African NLP Datasets: A Veridical Audit

Malwandla Ngobeni , Merhawi Hailu

Partner: ip-law

Year: 2026

Abstract: This study investigates 249 African Natural Language Processing (NLP) projects from the perspective of legal compliance and data governance. A weighted risk model scored from 0 to 8 classifies projects as low, medium and high risk. The PCS framework was applied to transform 38 qualitative legal variables into computable indicators, enabling evidence-based analysis of copyright and licensing practices across the African NLP landscape. The results highlighted that most projects fall within the medium-to-high risk categories, indicating systemic gaps in legal documentation and governance. An interactive Streamlit dashboard featuring dynamic filtering, cluster profiling, and a real-time Risk Auditor tool is deployed to support policymakers and institutional leaders in identifying governance failures and making informed decisions. The study concludes with recommendations for dataset expansion, permanent hosting, and multilingual accessibility to advance equitable AI governance across Africa.

Presentation Video