Abstract:
The UPSpace online repository is an open archive for collecting, preserving, and distributing digital materials created by the University of Pretoria (UP) community. Each record includes metadata about research that its associated faculty and department have completed. However, understanding the research areas covered and quantifying the completed research has been challenging and labour-intensive. This hampers the UP-community's ability to comprehensively assess research output, identify trends within faculties and departments, suggest potential transdisciplinary research, and make recommendations for future research and collaborations. To address this, an end-to-end solution was developed that harvests data from the UPSpace repository, cleans it, applies BERTopic modelling techniques (topic coherence of 95% on 100 random samples), expands the data, and visualizes it on a Power BI dashboard. The data processing steps, excluding visualization, are performed using Python. The pipeline uses a SQLite database to store the harvested, cleaned, modelled, and expanded data. The Power BI dashboard connects to this SQLite database. The Predictability, Computability, and Stability (PCS) framework was used to ensure the model's accuracy, clarity, efficiency, robustness, scalability, and reproducibility. The solution has been successful and is ready for deployment to the UP community. However, UP needs to consider several factors for deployment, such as determining the location and frequency of running the solution, deciding where to store the data, obtaining a Power BI license for publishing the dashboard, and securing a strong processor to ensure smooth and quick model execution.