Copyright Compliance and Licensing Patterns in African NLP Datasets: A Veridical Audit
Malwandla Ngobeni , Merhawi Hailu
Partner: ip-law
Year: 2026
Abstract:
This study investigates 249 African Natural Language Processing (NLP) projects from the perspective of legal compliance and data governance. A weighted risk model scored from 0 to 8 classifies projects as low, medium and high risk. The PCS framework was applied to transform 38 qualitative legal variables into computable indicators, enabling evidence-based analysis of copyright and licensing practices across the African NLP landscape. The results highlighted that most projects fall within the medium-to-high risk categories, indicating systemic gaps in legal documentation and governance. An interactive Streamlit dashboard featuring dynamic filtering, cluster profiling, and a real-time Risk Auditor tool is deployed to support policymakers and institutional leaders in identifying governance failures and making informed decisions. The study concludes with recommendations for dataset expansion, permanent hosting, and multilingual accessibility to advance equitable AI governance across Africa.