A Grammar engine for Nguni natural language interfaces (GeNi)
Introduction and objectives - participants - outputs
Project funded by the National Research Foundation of South Africal under the Competitive Programme for Rated Researchers (CPRR) -- Y-rated development grant, 2015-2017 (3 years)
OverviewIntroduction and background
The use of natural languages in applications is ubiquitous. Canned, unchangeable, textcan be used for some scenarios, but not when the information to be communicateddepends on the context and large amounts of text. This is addressed by controlled naturallanguages and natural language generation (NLG) systems, which take structured data orknowledge as domain input, and are matched at runtime with templates or a grammarengine to generate the text. NLG systems mainly focus on generating English, however,and neither an NLG system nor sufficient theoretical foundations exist for the indigenousSouth African languages, despite the requirements for it. Preliminary results in isiZulu NLGhave shown that a template-based approach is unfeasible for Bantu languages, due to,mainly, their complex grammar rules, noun class system, and agglutination. Thus, extantNLG systems cannot be adopted for Bantu languages, and a grammar engine is requiredto obtain automatically generated understandable text.
The aims of this project are to define the formal and algorithmic foundations for anisiZulu/isiXhosa grammar engine and to implement it to realize a (controlled) NLG system.The project will uncover sentence and linguistic realization patterns, postulated to be verysimilar for isiZulu and isiXhosa, and it will ensure incorporation of multilingualism. Therules and modular, efficient, algorithms will make the grammar usable for computation.This will be optimized on linguistic annotations of the input and text generation at runtime.A proof-of-concept grammar engine for isiZulu/isiXhosa will be developed to validate thetheory. To ensure broad usability and interoperability with related theoretical andtechnological advances, such as linguistic linked data and ontology-driven informationsystems, it will use as input files domain knowledge that is represented in ontologiesserialized in the Semantic Web language OWL, which also facilitates incremental systemdevelopment.
Participants and collaborators
- Maria Keet (PI), Department of Computer Science, University of Cape Town (UCT)
- Joan Byamugisha (PhD student), Department of Computer Science, UCT
- Zola Mahlaza (MSc student), Department of Computer Science, UCT
- Jarvis Mutakha (MSc student), Department of Computer Science, UCT
- Programmers: Takunda Chirema and Musa Xakaza, Department of Computer Science, UCT
- Former students affiliated with the project at CS@UCT: Catherine Chavula, Victor Kabine, Lyneve Laing, Balone Ndaba
- Langa Khumalo, Linguistics Program, School of Arts, University of KwaZulu-Natal
- Mantoa Smouse, African Languages and Literatures Section, UCT
- Zukile Jama, African Languages and Literatures Section, UCT
OutputsA simplified view is as follows. One has the data, information, or knowledge represented in a structured way, e.g., in a Description Logic (DL; right-hand side of the figure below). They serve as input to certain algorithms (arrows pointing to their respective names). Each algorithm determines how it is verbalised (implemented as a set of functions written in Python in this case). Their respective automatically generated outputs are shown in the line below it, which are sentences in isiZulu. This involves a set of core functions for the axioms [RuleML14, CNL14, LRE16], how to pluralise isiZulu nouns [CICLing16], and how to handle part-whole relations [INLG16]. The above figure shows the various components being linked up 'conceptually', i.e, which axiom types are linked to whcih functions in Python (well, a subset of what is supported). This has been implemented in the meantime. That is: the "DL axiom" on the right-hand side of the figure is serialised in an OWL file so that a computer can process it, which is then linked to the implemented verbalisation algorithms using Owlready to process that OWL file, A graphical user interface is wrapped around it. This GUI is shown in the following screenshot, which also has some annotations added to it afterward so as to provide some explanation about what's going on. There is a bit of a disconnect between how the relations (verbs, object properties) are represented in that structured knowledge representation and what we need for isiZulu. For instance, in the figure above, on the right-hand side, it says "dla" (eat), but the output on the left-hand side shows it as, e.g. "zidla" and "azidli". The algorithm takes care of that through knowing the noun class of the noun and whether the verb is negated or not. There are more such issues, notably with prepositions, such as in 'part of' and 'contained in' [INLG16]. This is now dealt with using a new model for annotations and a separate data structure [EKAW16 and examples].
There are some indications as to how well these fundamentals will, or will not, work with languages related to isiZulu. A language spoken several thousand km up north in Uganda, Runyankore, was experimented with, and the bootstrapping approach from isiZulu was promosing [CNL16]. While this was initially surprising, an orthographic analysis showed it to be fairly similar to isiZulu regarding agglutination, as did several other languages not in the Nguni language clusters, such as chiShona, but not Kiswahili [arxiv16].
- Keet, C.M., Chirema, T. A model for verbalising relations with roles in multiple languages. 20th International Conference on Knowledge Engineering and Knowledge Management (EKAW'16). Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (Eds.). Springer LNAI vol. 10024, 384-399. 19-23 November 2016, Bologna, Italy. tool and examples
- Keet, C.M. An assessment of orthographic similarity measures for several African languages. Technical Report, Arxiv.org, http://arxiv.org/abs/1608.03065. Aug 10, 2016. 9p.
- Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG'16), 5-8 September, 2016, Edinburgh, UK. Association for Computational Linguistics, 174-183.
- Byamugisha, J., Keet, C.M., DeRenzi, B. Tense and Aspect in Runyankore using a Context-Free Grammar. 9th International Natural Language Generation conference (INLG'16), 5-8 September, 2016, Edinburgh, UK. Association for Computational Linguistics, 84-88.
- Byamugisha, J., Keet, C.M., DeRenzi, B. Bootstrapping a Runyankore CNL from an isiZulu CNL. 5th Workshop on Controlled Natural Language (CNL'16), Springer LNAI vol 9767, 25-36. 25-27 July 2016, Aberdeen, UK. BEST STUDENT PAPER AWARD
- Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'16), Springer LNCS. April 3-9, 2016, Konya, Turkey. (in print) FIRST PRIZE "Verifiability, Reproducibility and Working Description Award"
- Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. Paul Cunningham and Miriam Cunningham (Eds). IIMC International Information Management Corporation. May 11-13, 2016, Durban, South Africa.
- Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016, in print.
- Chavula, C., Keet, C.M. An Orchestration Framework for Linguistic Task Ontologies. 9th Metadata and Semantics Research Conference (MTSR'15), Garoufallou, E. et al. (Eds.). Springer CCIS vol. 544, 3-14. 9-11 September, 2015, Manchester, UK.
- Screencast of the verbaliser that verbalises an OWL ontology (70MB .mov file)
- Toward isiZulu Natural Language Generation. Computer Science Department, University of Cape Town, South Africa, June 12, 2014.
- Protege plugin impala.jar for the positionalist ontology view and language annotation (iMPALA could be an abbreviation for Model for Positionalism And Language Annotation) and examples of annotated ontologies and screenshots thereof.
- A partial grammar of the isiZulu verb, in JFlap format and several screenshots of the evaluation.
- An implementation of the isiZulu verbalisation up to the LRE paper + pluraliser, which was used to create the examples in the first figure above, and extended with verbalising part-whole relations described in the INLG16 paper. The latter algorithms packed together as OWL verbaliser for isiZulu, which was used to take the screenshot, above (see 'presentations', above, for a screencast of the tool).
- Supplementary material of the CICLing'16 paper: the isiZulu and Runyankore pluralisers, testdata, and data anaysis.
- isiZulu news articles Aug-Sept 2015 mini corpus.
- Latest version of the NCS ontologies (in NCSxx.zip)
- Blog posts:
- Surprising similarities and differences in orthography across several African languages, October 18, 2016
- Brief report on the INLG16 conference, September 12, 2016
- On generating isiZulu sentences with part-whole relations, August 20, 2016
- Bootstrapping a Runyankore CNL from an isiZUlu one mostly works well, July 31, 2016
- Preliminary promising results on a data-driven spellchecker for isiZulu, May 12, 2016
- Pluralising isiZulu nouns, automatically, March 30, 2016
- More results on a CNL for isiZulu, Feb 14, 2016
- Quasi wordles of isiZulu online newspaper articles from this weekend, Aug 10, 2015
- An orchestration of ontologies for linguistic knowledge Aug 5, 2015