Visualizing University Research and SDG Contributions in South Africa
Phoebe Mailwane , Desire Sundire
Partner: sdg-hub
Year: 2023
Abstract:
The seventeen Sustainable Development Goals (SDGs) adopted by the United Nations General Assembly in 2015 guides global efforts to achieve a better and sustainable future for humans and the planet. The South African SDG Hub maintains a large database of SDG-related research metadata collected from South African and selected international university repositories and classifies the research articles according to SDGs. However, it is difficult for users to make sense of the huge volume of research metadata, thereby limiting the ability of stakeholders to promote sustainable development. This project is an initiative of the South African SDG Hub to promote sustainable development in South Africa by improving access and understanding of the SDG research efforts through interactive dashboards that provide various analyses and insights into SDG research landscape. The project follows the various stages of the Data Science Life Cycle from project scoping to deployment with the aim of generating meaningful and reliable insights that facilitate better access, usage, and understanding of research contributions towards sustainable development. Python programming libraries were used for exploring the research metadata, pre-processing the data, building machine learning models and visualising the model results. The Extreme Gradient Boosting (XGBoost) algorithm was used for analysing research trends while the Latent Dirichlet Allocation (LDA) topic modelling algorithm was used to discover latent topics and themes from the large collection of research metadata. Both algorithms performed reasonably well with XGBoost achieving an R2 goodness of fit score of around 0.63 and LDA reaching a coherence score of up to 0.75. The Streamlit open-source framework was used for creating dashboards and deploying the web application. The principles of predictability, computability, and stability (PCS) were applied throughout the project to design and evaluate the project outcomes. The approaches used in this project relatively succeeded in meeting the project goals. However, some technical constraints such as limited memory and processing power limited the ability to fully optimise the models training and performance. Adding more data sources and running the models on a high-performance computing environment can improve the training and accuracy of the models.