Journal article
RAIL, 2020
APA
Click to copy
Malema, G., Okgetheng, B., Tebalo, B., Motlhanka, M., & Rammidi, G. (2020). Complex Setswana Parts of Speech Tagging. RAIL.
Chicago/Turabian
Click to copy
Malema, G., Boago Okgetheng, Bopaki Tebalo, Moffat Motlhanka, and Goaletsa Rammidi. “Complex Setswana Parts of Speech Tagging.” RAIL (2020).
MLA
Click to copy
Malema, G., et al. “Complex Setswana Parts of Speech Tagging.” RAIL, 2020.
BibTeX Click to copy
@article{g2020a,
title = {Complex Setswana Parts of Speech Tagging},
year = {2020},
journal = {RAIL},
author = {Malema, G. and Okgetheng, Boago and Tebalo, Bopaki and Motlhanka, Moffat and Rammidi, Goaletsa}
}
Setswana language is one of the Bantu languages written disjunctively. Some of its parts of speech such as qualificatives and some adverbs are made up of multiple words. That is, the part of speech is made up of a group of words. The disjunctive style of writing poses a challenge when a sentence is tokenized or when tagging. A few studies have been done on identification of multi-word parts of speech. In this study we go further to tokenize complex parts of speech which are formed by extending basic forms of multi-word parts of speech. The parts of speech are extended by recursively concatenating more parts of speech to a basic form of parts of speech. We developed rules for building complex relative parts of speech. A morphological analyzer and Python NLTK are used to tag individual words and basic forms of multi-word parts of speech. Developed rules are then used to identify complex parts of speech. Results from a 300 sentence text files give a performance of 74%. The tagger fails when it encounters expansion rules not implemented and when tagging by the morphological analyzer is incorrect.