MoRe NL: foundations of a Modular Realisation Engine for Nguni Languages

NRF CPRR grant (2020-2022), Grant number 120852

Project summary - Outputs - Members and collaborators

Project summary

A multitude of socio-economic and political factors cause language barriers to persist in healthcare and other areas, such as weather forecasts, for the vast majority of people in South Africa. Computer applications may alleviate these issues by translations or generating the required contextually relevant text from structured input. The latter is addressed by Natural Language Generation (NLG). The current state of NLG for Nguni languages--one of the two main groups of indigenous languages of to South Africa--is in the exploratory stage, which has led to a clear set of problems that need to be resolved. As templates are generally inapplicable, once-off patterns were defined, but there is no NLG pattern specification language. The algorithms for the few knowledge-to-text sentences supported are ad hoc, rather than systematically and modular for flexible reuse across application scenarios. Further, looking beyond isiZulu to related languages, there is no theory, nor tool, nor even an approach for easy reuse and adaptation--or: bootstrapping--the resources for those other languages that are also widely spoken.
The aim of this project is to carry out the research needed to build a generic framework for a NLG realization engine for at least the Nguni language group, inclusive of an entirely novel NLG pattern specification language with annotation model, that will be modular and domain-independent so that one can 'mix and match' word fragments, clitics, and concords as needed for the task. This will be computationally tractable and be usable with popular NLP tools and knowledge representation systems, such as NLTK and RDF and OWL. This will enable designers to generate sentences in the Nguni languages and in related Bantu languages for a range of applications. Further, in aiming for generalizability of such a realisation engine, a solution will be found for devising computationally usable measures with predictive power for bootstrapping across related Bantu languages.

Outputs

  1. Dawson, W.L., Keet, C.M. Ontology Pattern Substitution: Toward their use for domain ontologies. FOIS'24 Demonstrations Track. Enschede, The Netherlands, 15-19 July 2024. CEUR-WS (in print)

  2. Mahlaza, Z., Keet, C.M. Surface realisation architecture for low-resourced African languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(3):1-26.

  3. Keet, C.M., Khumalo, L. Mahlaza, Z. Considerations for a model for NCB noun classes in Wikidata. WikiWorkshop 2022, April 25, 2022, online. (abstract)

  4. Gillis-Webber, F., Keet, C.M. A Survey of Multilingual OWL Ontologies in BioPortal. 13th Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS'22). Wolstencroft, K. et al. (Eds.). CEUR-WS Vol. 3127, 87-96. Leiden, the Netherlands, January 10-13 2022.

  5. Mahlaza, Z., Keet, C.M. ToCT: A task ontology to manage complex templates. FOIS'21 Ontology Showcase, 13-16 September 2021, Bolzano, Italy. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.

  6. Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. IST-Africa Institute and IIMC Ireland. Cunningham, M. and Cunningham, P. (Eds). 10-14 May 2021, online.

  7. Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

  8. Mahlaza, Z., Keet, C.M. OWLSIZ: An isiZulu CNL for structured knowledge validation. 3rd Workshop on Natural Language Generation from the Semantic Web (WebNLG'20), ACL, pp15-25. 18 Dec 2020, Dublin, Ireland.

  9. Keet, C.M., Khumalo, L. Parthood and Part--Whole Relations in Zulu Language and Culture. Applied Ontology, 2020, 15(3): 361-384.

  10. Theses and dissertations:
  11. Honours projects:
  12. Technical reports:
  13. Talks and tutorials:
  14. Proof-of-concept programs and related software artefacts:


Members and collaborators