KeBoNa: Knowledge-driven bootstrapping of computational language resources for Niger-Congo B languages

KeBoNa: Knowledge-driven Bootstrapping of computational language resources for Niger-Congo B languages

National Research Foundation CPRR grant (2024-2026), Grant number 23040389063

Project summary - Outputs - Members and collaborators

Project summary

Natural language processing is ubiquitous especially for well-resourced languages. Yet speakers of low-resourced languages, such as those of the Niger-Congo-B (‘Bantu’) language family, also want to have tools such as spelling and grammar checkers and chatbots and customised patient discharge notes in their first language. Due to insufficient language data, such technologies also require a more laborious knowledge-based approach. It is therefore imperative to bootstrap a new resource in one language from an existing one in a related language, for efficient reuse of resources. Insight into bootstrapping for NCB languages is sporadic, however, and is hindered by a lack of options for meaningful annotations across the languages and ontology-mediated annotation systems, such as GOLD and OLiA, do not include NCB-specific linguistically important elements, nor are those resources harmonised and aligned with foundational ontology principles.
The main aim of the project is to investigate bootstrapping strategies with a novel enhanced knowledge-mediated approach, to eventually be able to state, in an informed way, which task can bootstrap well from what language resources, and why. In one strand of research, we will investigate ontologically, and design, an integrative ontology or knowledge graph module to complement data-driven strategies with meaning, to incorporate specifics of NCB languages for resource annotation and comparison, and such that it is compatible with extant ontology ecosystems. The other strand of research concerns devising NCB-relevant metrics to compute bootstrapping effects and to compute similarity among languages to quantify the potential for, and benefits of, bootstrapping computational resources, availing of the knowledge resources. Both will inform each other, and we will evaluate the theory with concrete existing and novel computational tasks for NCB languages, developing new computational resources in the process.

Outputs

Articles:
1. Sayed, I., Mahlaza, Z., Van der Leek, A., Mopp. J., Keet, C.M. On the usage of semantics, syntax, and morphology for noun classification in isiZulu. Resources and representations for under-resourced languages and domains (RESOURCEFUL-2025), co-located with NoDaLiDa/Baltic-HLT 2025. March 2, 2025, Tallinn, Estonia.
2. Mahlaza, Z., Sayed, I., Van der Leek, A., Keet, C.M. IsiZulu noun classification based on replicating the ensemble approach for Runyankore. The First Workshop on Language Models for Low-Resource Languages (LoResLM25), co-located with COLING'25. January 20, 2025, Abu Dhabi. ACL.
3. Mahlaza, Z., Magwenzi, T., Keet, C.M., Khumalo, L. Automatically Generating IsiZulu Words From Indo-Arabic Numerals. 17th International Natural Language Generation Conference (INLG'24), Tokyo, Japan, September 23-27, 2024. ACL.
Abstracts, Demo papers:
1. Keet, C.M. Preliminary steps toward an ontology for noun classes in Niger-Congo languages (abstract). CAOS: Cognition And OntologieS (CAOS'24). Enschede, The Netherlands, 15 July 2024.
2. Marquard, C. IsiXhosa.click: online, open, user-friendly, and searchable isiXhosa-English dictionary software. African Association for Lexicography - 28th International Conference 2024, 35-36, African Association for Lexicography.
3. Buthelezi, M., Marquard, C. Using computational tools and a corpus lexicography framework in developing an isiZulu LSP Dictionary. Proceedings of 28th International Conference of the African Association for Lexicography, 1-4 July 2024, Pretoria, South Africa, 20-22, African Association for Lexicography.
Honours projects:
- Imaan Sayed: Reproducing a Combined Semantic-Syntactic Method for Noun Classification in isiZulu (2024)
- Alexander van der Leek: TBA (2024)
- Jonathan Mopp: TBA (2024)

Members and collaborators

Assoc. Prof. Maria Keet, UCT; PI
Prof. Langa Khumalo, SaDiLaR at NWU; research associate in linguistics
Dr. Zubeida Khan (Dawood), CSIR; research associate in computer science
Dr. Zola Mahlaza, Department of Computer Science at UCT; research associate
Dr. Wanga Gambushe, African Languages and Literatures, UCT; research associate
Mr. Mthuli Buthelezi, PhD student in linguistics, UKZN
Mr. Phuthang Makhupane; MSc student in Computer Science, UCT
BSc honours project students (since 2024): Imaan Sayed, Jonathan Mopp, Alexander van der Leek
Scientific programmers and research assistants (since 2024): Tadiwa Magwenzi, Imaan Sayed, Sanele Dlamini

The KeBoNa project