Corpus Based NLP Framework for Thematic Analysis of English Climate Documents and Policies in Africa

Steven Groenewald , Thulisile Mthembu

Partner: ems

Year: 2026

Abstract: Climate policy priorities are embedded in long, inconsistent governance documents, making systematic comparison across countries and institutions difficult at scale. This project develops a corpus-guided natural language processing framework for analysing English-language African climate governance documents across three governance contexts: national climate policies, UNFCCC submissions, and African Union policy frameworks. A corpus of 163 documents was converted from raw PDFs into a cleaned paragraph-level dataset of 85,057 paragraphs, then classified across 15 climate policy themes using weak labelling, manual annotation, and supervised modelling. Three modelling approaches were evaluated, with TF-IDF and Logistic Regression performing best overall, achieving a micro-F1 score of 71.2% and supporting document-level thematic analysis. The results show that mitigation is the shared anchor across African climate governance documents, but that policy emphasis shifts by governance context. UNFCCC submissions are more technical and reporting-oriented, national policies place greater emphasis on domestic implementation priorities, and African Union documents provide broader continental strategic framing. The strongest thematic relationship appears between African Union and national documents, while UNFCCC and national policy alignment is weaker.

Presentation Video