Spellcheckers for isiZulu and isiXhosa


Overview - outputs and and download - participants

Overview

Spellcheckers for the most widely-spoken languages in the world are ubiquitous, from having them intergrated in text processing software, browsers, email software, and smartphones, among others. This is much less so for under-resourced languages, yet also speakers of such languages would want one. What got our efforts to develop spellcheckers for isiZulu off the ground was an explicit user request (from UKZN's ULPDO).

There are several ways to develop a spellchecker for a language: encode the rules of its writing (ortography), using a dictionary, or learn a language model from lots of text (a corpus). Using a dictionary in the case of an agglutinating language is a non-starter, but we tried the other two approaches. The rules work better for those types of words (POS categories) covered in the rules than the statistical approach, but we have only a subset of those rules. Currently, the two released spellcheckers both use the data-driven approach.

The data-driven approach is illustrated in the following figure. First, there are the words in the corpus (just 4 in the example, but effectively, one needs at least some 20000 words of modern-day texts and preferable some 300 000). Second, the words are split up into trigrams. Third, a language model is computed from the trigrams: how often the trigram appears in the corpus compared to the other trigrams, where the uncommon ones are discarded are noise. Fourth, when a word is fed into the error detection algorithm, it splits up that word into trigrams and checks whether each one is sufficiently common; if so, then the word is porbably spelled correctly (e.g., ngihamba and ngivela in the figure), if not, then it is probably spelled incorrectly and flagged as such (gnihamba and ngivea).
For error correction, we use the same language model and some other statistics, like how probable it is that one trigram follows another, the levenshtein distance, and a list of words that were common in the corpus.



Here's a screenshot with some actual text and the performance on it. Overall, testing showed that accuracy and recall are around 90%. It also suggests several corrections, which works well especially for transpositions (two adjacent letters swapped). Note that both the detector and corrector only consider so-called non-word errors where a single character is misspelt.



Other features of the tool are, among others, copy-paste of text, opening and saving files (txt or MS Word), and adding a word to your spellchecker installation's dictionary. You can switch between isiZulu or isiXhosa with the drop-down menu.

We know that a standalone jar file (that requires JDK) is not optimal, but integrating it with existing tools (browsers, text processing software) proved harder than it sounds for various reasons. So, for the time being, it is still this option.

For what it is worth it: we did evaluate v1 of the spellchecker, and users apparently liked it sufficiently and it turned out to contribute to intellectualisation of isiZulu.

There probably will be a v3 in the future, as there are several intersting 'loose ends' that we know of already. If you have any comments or suggestions, please contact us.

Outputs and download

Participants, collaborators, contributors