MoRe NL: foundations of a Modular Realisation Engine for Nguni Languages
NRF CPRR grant (2020-2022), Grant number 120852Project summary - Outputs - Members and collaborators
Project summary
A multitude of socio-economic and political factors cause language barriers to persist in healthcare and other areas, such as weather forecasts, for the vast majority of people in South Africa. Computer applications may alleviate these issues by translations or generating the required contextually relevant text from structured input. The latter is addressed by Natural Language Generation (NLG). The current state of NLG for Nguni languages--one of the two main groups of indigenous languages of to South Africa--is in the exploratory stage, which has led to a clear set of problems that need to be resolved. As templates are generally inapplicable, once-off patterns were defined, but there is no NLG pattern specification language. The algorithms for the few knowledge-to-text sentences supported are ad hoc, rather than systematically and modular for flexible reuse across application scenarios. Further, looking beyond isiZulu to related languages, there is no theory, nor tool, nor even an approach for easy reuse and adaptation--or: bootstrapping--the resources for those other languages that are also widely spoken.The aim of this project is to carry out the research needed to build a generic framework for a NLG realization engine for at least the Nguni language group, inclusive of an entirely novel NLG pattern specification language with annotation model, that will be modular and domain-independent so that one can 'mix and match' word fragments, clitics, and concords as needed for the task. This will be computationally tractable and be usable with popular NLP tools and knowledge representation systems, such as NLTK and RDF and OWL. This will enable designers to generate sentences in the Nguni languages and in related Bantu languages for a range of applications. Further, in aiming for generalizability of such a realisation engine, a solution will be found for devising computationally usable measures with predictive power for bootstrapping across related Bantu languages.
Outputs
- Dawson, W.L., Keet, C.M. Ontology Pattern Substitution: Toward their use for domain ontologies. FOIS'24 Demonstrations Track. Enschede, The Netherlands, 15-19 July 2024. CEUR-WS (in print)
-
Mahlaza, Z., Keet, C.M. Surface realisation architecture for low-resourced African languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(3):1-26.
- Keet, C.M., Khumalo, L. Mahlaza, Z. Considerations for a model for NCB noun classes in Wikidata. WikiWorkshop 2022, April 25, 2022, online. (abstract)
- Gillis-Webber, F., Keet, C.M. A Survey of Multilingual OWL Ontologies in BioPortal. 13th Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS'22). Wolstencroft, K. et al. (Eds.). CEUR-WS Vol. 3127, 87-96. Leiden, the Netherlands, January 10-13 2022.
- Mahlaza, Z., Keet, C.M. ToCT: A task ontology to manage complex templates. FOIS'21 Ontology Showcase, 13-16 September 2021, Bolzano, Italy. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.
- Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. IST-Africa Institute and IIMC Ireland. Cunningham, M. and Cunningham, P. (Eds). 10-14 May 2021, online.
- Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.
- Mahlaza, Z., Keet, C.M. OWLSIZ: An isiZulu CNL for structured knowledge validation. 3rd Workshop on Natural Language Generation from the Semantic Web (WebNLG'20), ACL, pp15-25. 18 Dec 2020, Dublin, Ireland.
- Keet, C.M., Khumalo, L. Parthood and Part--Whole Relations in Zulu Language and Culture. Applied Ontology, 2020, 15(3): 361-384.
-
Theses and dissertations:
- Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu by Dr. Zola Mahlaza, graduated in 2022.
- Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu by Dr. Zola Mahlaza, graduated in 2022.
-
Honours projects:
- Digitial Assistant for Financial Transactions by Junior Moraba and Amy Solomons, in 2021.
- Generating natural language text in isiZulu from mathematical expressions by Shan Smith (main supervisor: Zola Mahlaza), in 2020.
-
Technical reports:
- Keet, C.M., Khumalo, L. Contextualising Levels of Language Resourcedness affecting Digital Processing of Text. Technical Report, Arxiv.org, number 2309.17035. 18p. 29 September 2023.
- Arrieta, K., Fillottrani, P.R., Keet, C.M. CoSMo: A constructor specification language for Abstract Wikipedia's content selection process. Technical Report, Arxiv.org, number 2308.02539. 32p. 1 August 2023.
- Keet, C.M. Bootstrapping NLP tools across low-resourced African languages: an overview and prospects. Technical report (arxiv). October 2022.
- Gillis-Webber, F., Keet, C.M. A Review of Multilingualism in and for Ontologies. Technical report (arxiv). October 2022.
- Gutman, A., Keet, CM. Abstract Wikipedia/Template Language for Wikifunctions. Proposal. 27 July 2022.
-
Talks and tutorials:
- Knowledge-to-text Natural Language Generation for Agglutinating African Languages. TechTalk at the Wikimedia Foundation google.org fellows offsite workshop, Google Zurich, Switzerland, 23-26 August 2022. video on Wikimedia
- JOWO 2022 tutorial: Generating text from ontologies in multiple languages. Jönköping, Sweden, 15-19 August.
- Encoding Biases' Influences on Development and Use of Ontologies in the Life Sciences. Keynote at Bio-Ontologies, part of Intelligent Systems for Molecular Biology 2022 (ISMB'22), 10-14 July 2022, Madison, USA.
- Natural Language Generation for Agglutinating African Languages -- A brief overview. Digital Humanities Colloquium, at SADiLaR, 18 May 2022 (online). screen recording on YouTube - slides
- Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. Conference presentation of the paper at the IST-Africa'21 conference. screen recording
-
Proof-of-concept programs and related software artefacts:
- The project on Github
- Modular realisation engine for isiZulu and isiXhosa
- ToCT: Task ontology for CNL-based Templates
- TEdi: Tool for creating templates
Members and collaborators
- Assoc. Prof. Maria Keet, UCT; PI
- Prof. Langa Khumalo, SADILAR; research associate
- Dr. Zubeida Khan, CSIR; research associate
- Mr. Zola Mahlaza, PhD student, UCT; research associate
- Ms. Frances Gillis-Webber, PhD student, UCT
- Mr. Leighton Dawson; MSc student, UCT
- Scientific programmers and research assistants (since 2020): Blessed Chitamba, Kouthar Dollie, Sindiso Mkhatshwa, Junior Moraba, Gerald Ngumbulu, Toky Raboanary