Gideon Kotzé - Publications

The importance of documenting place names has been well established. Conversely, place names in sign languages, or signed place names, have been largely ignored. Just as with written place names, signed place names constitute cultural heritage and function as socio-political markers. Signed place names are usually not standardised. Instead, they are assigned through a process of conventionalisation by the Deaf community. As such, they form a unique parallel toponymy, in that it is unwritten as well as dynamic. For these conventionalised names to be recognised, whether officially or informally, they must first be documented. However, South African Sign Language (SASL) place names are not systematically recorded in any SASL dictionaries, word lists or other databases. This paper reports on the development of a mobile application of South African place names that incorporates a selection of SASL place names. A mobile app is an accessible platform that can create awareness of this parallel toponymy. Using languages on public platforms like these also legitimises them, a key factor considering the imminent officialization of SASL. We describe the project and the features of the app in more detail and elaborate on the impact we anticipate it could have on the promotion of SASL. Since the data selection is limited, the app and its supporting database will serve as an initiatory repository for SASL place names that will hopefully stimulate further research.

@InProceedings{LothEA:2022,

author = {Loth, Chrismi-Rinda, Kotz\'{e}, Gideon and De Lange, Jani},

title = {Finding place names: Improving the digital documentation and accessibility of SASL place names},

booktitle = {Conference proceedings of the 6th International Symposium on Place Names},

editor = {Chrismi-Rinda Loth},

year = {2022},

pages = {},

volume = {},

issue = {},

publisher = {Sun Bonani},

address = {Virtual/Bloemfontein, South Africa},

doi = {10.18820/9781928424970}

}

We describe past and present work surrounding the development of treebank related NLP resources for Georgian. In particular, we provide an overview of efforts made in the development of a morphologically and syntactically annotated treebank for this non-configurational language, as well as its application in the development of a syntactic parser. Building on this, we also report ongoing work in utilizing manual and automatic alignment solutions for the creation of a Georgian/German parallel treebank. The end goal is the development of resources and tools for improved computational processing and linguistic analysis of the Georgian language.

@Article{KapanadzeEA:2022,

author = {Kapanadze, Oleg and Kotz\'{e}, Gideon and Hanneforth, Thomas},

title = {Building Resources for Georgian Treebanking-based NLP},

journal = {Lecture Notes in Computer Science},

year = {2022},

pages = {60--78},

volume = {13206},

issue = {},

publisher = {Springer},

doi = {10.1007/978-3-030-98479-3_4}

}

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.

@InProceedings{KotzeWolff:2020,

author = {Kotz\'{e}, Gideon and Wolff, Friedel},

title = {Exchanging image processing and OCR components in a Setswana digitisation pipeline},

journal = {South African Computer Journal},

year = {2020},

pages = {281--231},

volume = {32},

issue = {2},

publisher = {South African Institute of Computer Scientists and Information Technologists (SAICSIT)},

doi = {10.1007/s10579-016-9369-0}

}

@InProceedings{HanneforthEtAl:2019,

author = {Hanneforth, Thomas and Kapanadze, Oleg and Kotz\'{e}, Gideon},

title = {Applying Computer Technologies to the Georgian Language: From a Treebank to a Syntactic Parser},

journal = {Thirteenth International Tbilisi Symposium on Language, Logic and Computation},

year = {2019},

month = {September},

address = {Batumi, Georgia},

url = {http://events.illc.uva.nl/Tbilisi/Tbilisi2019/uploaded_files/inlineitem/Kapanadze.pdf}

}

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

@Article{KotzeEA:2017,

author = {Kotz\'{e}, Gideon and Vandeghinste, Vincent and Martens, Scott and Tiedemann, J\"{o}rg},

title = {Large aligned treebanks for syntax-based machine translation},

journal = {Language Resources and Evaluation},

year = {2017},

pages = {1--34},

volume = {51},

issue = {2},

publisher = {Springer},

doi = {10.1007/s10579-016-9369-0}

}

Optical Character Recognition (OCR) plays an important role in the creation of digital language resources. As OCR solutions are often language specific, the availability of models for South African languages also contributes to alleviating the language data scarcity problem. We describe the development of a digitisation pipeline in the context of a multilingual corpus project. We test a recently developed OCR model for the Setswana language against a selection of quality assured texts, while improving our output using image processing software and a newly developed tool, Ontrafel, for post-processing OCR output in PDF files. Each step in the pipeline is shown to improve the output quality when measured against the Character Error Rate metric. Finally, a qualitative analysis provides some insights that may contribute to refining steps or improving the existing OCR model. Apart from the creation of new digital language data for Setswana, we hope that our work stimulates and contributes to further research into high-quality digitisation of South African language resources.

@Article{KotzeWolff:2017,

author = {Kotz\'{e}, Gideon and Wolff, Friedel},

title = {{Developing and evaluating a pipeline for Setswana OCR}},

journal = {2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech)},

year = {2017},

pages = {236--241},

publisher = {IEEE},

volume = {},

number = {},

doi = {10.1109/RoboMech.2017.8261154},

month = {Nov}

}

Although their use in training quality machine translation systems has been proven, parallel corpora—large collections of translated texts—are generally hard to come by for the majority of languages. To counteract this fact, a relatively small collection may be processed in more depth by further cleaning and more accurately splitting and aligning the texts. We apply this to an existing English/Zulu parallel corpus that has been used for statistical machine translation experiments. After these preprocessing steps, we run the same experiments for comparative purposes. Our results suggest that compatibility of bitexts, the choice of sentence splitters used on different parts of the text, as well as manual work, may have a notable effect on both the corpus size and on automatic translation quality.

@Article{Kotze:2016,

author = {Kotz\'{e}, Gideon,

title = {Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation},

journal = {Proceedings of the 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)},

year = {2016},

pages = {48--53},

publisher = {IEEE},

isbn = {978-1-5090-3334-8},

doi = {10.1109/RoboMech.2016.7813168}

}

We present a series of experiments involving the machine translation of Zulu to English using a well-known statistical software system. Due to morphological complexity and relative scarcity of resources, the case of Zulu is challenging. Against a selection of baseline models, we show that a relatively naive approach of dividing Zulu words into syllables leads to a surprising improvement. We further improve on this model through manual configuration changes. Our best model significantly outperforms the baseline models (BLEU measure, at p < 0.001) even when they are optimised to a similar degree, only falling short of the well-known Morfessor morphological analyser that makes use of relatively sophisticated algorithms. These experiments suggest that even a simple optimisation procedure can improve the quality of this approach to a significant degree. This is promising particularly because it improves on a mostly language independent approach—at least within the same language family. Our work also drives the point home that sub-lexical alignment for Zulu is crucial for improved translation quality.

@Article{KotzeWolff:2015,

author = {Kotz\'{e}, Gideon and Wolff, Friedel},

title = {Syllabification and parameter optimisation in Zulu to English machine translation},

journal = {South African Computer Journal},

year = {2015},

number = {57},

pages = {1--23},

publisher = {South African Institute of Computer Scientists and Information Technologists (SAICSIT)},

doi = {10.18489/sacj.v0i57.323}

}

Due to morphological complexity and scarce resources, machine translation from Zulu to English is challenging. We investigate the possibility of phrase-based statistical machine translation from Zulu to English using syllables as the tokens in the Zulu source text. Initial experiments on a relatively small but multi-domain data set suggest merit in our approach, with our best syllable-based model outperforming the best word-based model by 12,90% using the BLEU evaluation measure. Our syllabification approach is largely language independent, at least within the Bantu language family, and holds promise for similar efforts in related languages.

@InProceedings{WolffKotze:2014,

author = {Wolff, Friedel and Kotz\'{e}, Gideon},

title = {Experiments with syllable-based {Zulu-English} machine translation},

journal = {Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium},

year = {2014},

editor = {Puttkammer, M. and Eiselen, R.},

publisher = {PRASA},

pages = {217--222},

address = {Cape Town, South Africa},

isbn = {978-0-620-62617-0}

}

As a morphologically complex language, Zulu has notable challenges aligning with English. One of the biggest concerns for statistical machine translation is the fact that the morphological complexity leads to a large number of words for which there exist very few examples in a corpus. To address the problem, we set about establishing an experimental baseline for lexical alignment by naively dividing the Zulu text into syllables, resembling its morphemes. A small quantitative as well as a more thorough qualitative evaluation suggests that our approach has merit, although certain issues remain. Although we have not yet determined the effect of this approach on machine translation, our first experiments suggest that an aligned parallel corpus with reasonable alignment accuracy can be created for a language pair, one of which is under-resourced, in as little as a few days. Furthermore, since very little language-specific knowledge was required for this task, our approach can almost certainly be applied to other language pairs and perhaps for other tasks as well.

@InProceedings{KotzeWolff:2014,

author = {Kotz\'{e}, Gideon and Wolff, Friedel},

title = {Experiments with syllable-based {English-Zulu} alignment},

journal = {Proceedings of the SaLTMiL Workshop on free/open-source language resources for the machine translation of less-resourced languages (at LREC 2014)},

year = {2014},

pages = {7--11},

address = {Reykjav\'{i}k, Iceland},

isbn = {978-2-9517408-8-4},

url = {http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-SALTMIL%20Proceedings.pdf}

}

Large collections of translated texts—called parallel corpora—are often automatically aligned on word and sentence level to be used as training data for machine translation systems. We may also choose to syntactically analyze the sentences to produce syntax trees. If we do this on both sides and the nodes of the trees are also aligned, the end result is called a parallel treebank. The best translation systems are statistically based, but in recent years there has been a shift to the incorporation of more linguistically motivated data, which includes the use of parallel treebanks. These are only useful on a very large scale because of the amount of information a system needs about how one language is to be translated into another in order to be effective. Because of this, we investigate techniques for the automatic and accurate alignment of these nodes. Another motive for our research is the fact that parallel treebanks are also useful for other techniques and that as a linguistic resource, remain scientifically interesting. This process is called tree alignment. We find that a combination of statistical and rule-based techniques, using relatively small sets of training data and few features, is sufficient to produce very accurate alignments. Finally, we also find that when we apply alignments covering a relatively large set of nodes—even though some of them are wrong—on a syntax-based machine translation system, this leads to better translation results than applying alignments that are more accurate but fewer in number.

@PhDThesis{Kotze:2013,

title = {Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods},

author = {Kotz\'{e}, Gideon},

school = {University of Groningen},

month = {June},

year = {2013},

isbn = {978-90-367-6177-2},

url = {http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab},

note = {\url{http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab}},

}

In this paper the PaCo-MT project is described, in which Parse and Corpus-based Machine Translation has been investigated: a data-driven approach to stochastic syntactic rule-based machine translation. In contrast to the phrase-based statistical machine translation systems (PB-SMT) which are string-based and do not use any linguistic knowledge, an MT engine in a different paradigm was built: a tree-based data-driven system that automatically induces translation rules from a large syntactically analysed parallel corpus. The architecture is presented in detail as well as an evaluation in comparison with our previous work and with the current state-of-the art PB-SMT system Moses.

@InBook{VandeghinsteEA:2013,

author = {Vandeghinste, Vincent and Martens, Scott and Kotz\'{e}, Gideon and Tiedemann, J\"{o}rg and Van den Bogaert, Joachim and De Smet, Koen and Van Eynde, Frank and Van Noord, Gertjan},

chapter = {Parse and Corpus-based Machine Translation},

title = {Essential Speech and Language Technology for Dutch},

pages = {305--319},

year = {2013},

isbn = {978-3-642-30909-0},

editor = {Spyns, P. and Odijk, J.},

publisher = {Springer},

url = {https://link.springer.com/chapter/10.1007%2F978-3-642-30910-6_17}

}

Previous experiments suggest that a rule-based approach to tree alignment error correction serves to be an effective complement to statistical alignment. We show how, using relatively few features, an implementation of Brill’s Transformation-Based Learning algorithm improves the results of a high precision model of the statistical aligner Lingua-Align. Using our system to correct already tree aligned data, we achieve balanced F-scores of 80.6 on our test set and 85.2 on our development test set. Using it as a tree aligner on word aligned data, our best F-scores using the same model amount to 78.7 and 83.0 respectively. Finally, we apply a pipeline of alignment and error correction tools to create several versions of a large parallel treebank consisting of various domains for Dutch to English for use in a syntax-based MT system. We conclude that transformation-based learning is a promising approach for the large-scale creation of parallel treebanks for various NLP purposes.

@Article{Kotze:2012,

author = {Kotz\'{e}, Gideon},

title = {Transformation-based tree-to-tree alignment},

journal = {Computational Linguistics in the Netherlands Journal},

volume = {2},

pages = {71--96},

year = {2012},

issn = {2211-4009},

url = {https://www.clinjournal.org/clinj/article/view/17/15}

}

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.

@InProceedings{KotzeEA:2012,

author = {Kotz\'{e}, Gideon and Vandeghinste, Vincent and Martens, Scott and Tiedemann, J\"{o}rg},

title = {Large Aligned Treebanks for Syntax-based Machine Translation},

booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},

pages = {467--473},

year = {2012},

address = {Istanbul, Turkey},

editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet U\v{g}ur Do\v{g}an and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},

publisher = {European Language Resources Association (ELRA)},

isbn = {978-2-9517408-7-7},

language = {english},

url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/924_Paper.pdf}

}

In this paper, we present results of an ongoing investigation of a manually aligned parallel treebank and an automatic tree aligner. We establish the features that show a signiﬁcant correlation with alignment performance. We present those features with the biggest correlation scores and discuss their signiﬁcance, with mention of future applications of these ﬁndings.

@Article{Kotze2011a,

author = {Kotz\'{e}, Gideon},

title = {Finding Statistically Motivated Features Influencing Subtree Alignment Performance},

editor = {Bolette Sandford Pedersen, Gunta Ne\v{s}pore and Inguna Skadi\c{n}a},

booktitle = {Proceedings of the 18th Nordic Conference of Computational Linguistics},

series = {NEALT Proceedings Series},

volume = {11},

publisher = {Tartu University Library},

pages = {332--335},

year = {2011},

address = {Riga, Latvia},

issn = {1736-8197},

url = {https://dspace.ut.ee/bitstream/handle/10062/17366/0Kotze_81.pdf}

}

Automatic alignment of parallel treebanks often display regular errors that can be corrected by improving the alignment model. However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based approach to error correction may provide a quick and convenient solution. We present an approach that highlights problematic phenomena which enables us to pinpoint systematic error patterns for which we can devise rules for correction. Finally, we investigate the application of two manually constructed rules on a large parallel treebank.

@Article{Kotze2011b,

author = {Kotz\'{e}, Gideon},

title = {Improving syntactic tree alignment through rule-based error correction},

booktitle = {Proceedings of ESSLLI 2011 Student Session},

pages = {122--127},

year = {2011},

address = {Ljubljana, Slovenia},

url = {https://web.stanford.edu/~danlass/esslli2011stus/kotze.pdf}

}

Automatic sub-tree alignment of parallel treebanks often display regular errors that can be corrected by improving the alignment model. However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based approach to error correction can provide a quick and convenient solution. We present an approach that highlights problematic phenomena which enables us to pinpoint regular error patterns for which we can devise rules for correction.

@InProceedings{Kotze2011c,

author = {Kotz\'{e}, Gideon},

title = {Rule-Induced Error Correction of Aligned Parallel Treebanks},

booktitle = {Proceedings of the International Conference "Corpus Linguistics - 2011"},

pages = {35--40},

year = {2011},

address = {Saint Petersburg, Russia},

isbn = {978-5-8465-0005-5},

url = {http://www.ccl.kuleuven.be/Projects/PACO/CorpusLinguistics_Kotze_final.pdf}

}

The PaCo-MT project is building a stochastic example-based transfer system translating from Dutch into English and French, and vice versa. It is a data-driven tree-to-tree based approach towards MT, transducing the input parse tree into a set of target language parse trees without node ordering. This Synchronous Tree Substitution Grammar (limited to regular subtrees) is induced from a subtree-aligned parallel treebank, using a discriminative model for tree alignment. Monolingual parses were created by pre-existing parsers, such as the Alpino parser for Dutch, the Stanford parser for English, and the Berkeley parser for French. A tree-based target language modeler using a probabilistic context-free grammar based on large monolingual treebanks decodes the output forest and determines node ordering.

By this approach we aim at combining the strengths of data-driven MT with the strengths of rule-based MT, avoiding the weaknesses of each of these approaches. Results show that although BLEU scores are not yet at par with Moses, long distance movements pose no problems for our approach, and we do not drop important words, yielding a more grammatical output than PBSMT systems.

@InProceedings{VandeghinsteEA:2011,

author = {Vandeghinste, Vincent and Van den Bogaert, Joachim and Martens, Scott and Kotz\'{e}, Gideon},

title = {PaCo-MT: Parse and Corpus-based Machine Translation},

booktitle = {Proceedings of the 15th International Conference of the European Association for Machine Translation},

year = {2011},

month = {May},

pages = {347},

editor = {Forcada, M.L., Depraetere, H., and Vandeghinste V.},

address = {Leuven, Belgium},

series = {EAMT},

isbn = {9789081486118},

url = {http://www.mt-archive.info/EAMT-2011-PaCoMT.pdf}

}

In this paper we propose a discriminative framework for automatic tree alignment. We use a rich feature set and a log-linear model trained on small amounts of hand-aligned training data. We include contextual features and link dependencies to improve the results even further. We achieve an overall F-score of almost 80% which is signiﬁcantly better than other scores reported for this task.

@InProceedings{TiedemannKotze2009,

author = {Tiedemann, Jörg and Kotz\'{e}, Gideon},

title = {A Discriminative Approach to Tree Alignment},

booktitle = {Proceedings of the International Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in connection with RANLP'09)},

year = {2009},

pages = {33--39},

publisher = {Association for Computational Linguistics},

address = {Borovets, Bulgaria},

month = {September},

editor = {Ilisei, I., Pekar, V. and Bernardini, S.},

url = {http://www.aclweb.org/anthology/W09-4206},

isbn = {978-954-452-010-6}

}

This paper reports on-going work on building a large automatically tree-aligned parallel treebank in the context of a syntax-based machine translation (MT) approach. For this we develop a discriminative tree aligner based on a log-linear model with a rich feature set. We incorporate various language-independent and language-speciﬁc features taking advantage of existing tools and annotation. Our initial experiments on a small hand-aligned treebank show promising results even with small amounts of training data. The performance of our approach is well above unsupervised techniques reported elsewhere. This enables us to quickly create training material and alignment models for additional language pairs. In recent work, we aligned more than one million sentence pairs and started our experiments with the extraction of transfer knowledge for our example-based machine translation system.

@InProceedings{TiedemannKotze2009b,

author = {Tiedemann, Jörg and Kotz\'{e}, Gideon},

title = {{Building a Large Machine-Aligned Parallel Treebank}},

booktitle = {Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT'08)},

year = {2009},

pages = {197--208},

topic = {Alignment and Parallel corpora},

editor = {Passarotti, M. and Przepi\'{o}rkowski, A. and Raynaud, S. and Van Eynde, F.},

publisher = {EDUCatt, Milano/Italy},

address = {Milan, Italy},

isbn = {978-88-8311-712-1},

url = {https://convegni.unicatt.it/meetings_B_1.pdf}

}

The Afrikaans wordnet is a lexical-conceptual network in the form of an electronic lexical database, developed at the North-West University. In this article, a methodology for a semi-automatic construction of the entries - so-called synonym sets - is investigated. Firstly, a background is given on the nature of a wordnet, as well as "WordNet", on which it is based. Other wordnets, as well as applications of wordnets, are also discussed here. Next, the macrostructure of a wordnet in terms of its integration and compatibility with other wordnets is investigated, after which the proposed methodology is presented with a discussion of the results. Finally, a projection is made to the integration of the Afrikaans wordnet with other resources, which include "WordNet" and an Afrikaans lexical database, called ALEXANDER.

@Article{Kotze2008,

author = {Kotz\'{e}, Gideon},

title = {Development of an Afrikaans wordnet: methodology and integration / Ontwikkeling van 'n Afrikaanse woordnet: metodologie en integrasie},

journal = {Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies: Human language technology for South African languages: Special Issue 1},

year = {2008},

volume = {29},

number = {1},

pages = {163--184},

url = {https://literator.org.za/index.php/literator/article/view/105},

issn = {0258-2279}

}

This thesis is written as part of the preliminary research for a proposed project at the Centre for Text Technology at the North-West University in Potchefstroom, North-West Province, South Africa. In this work a methodology for constructing a wordnet for Afrikaans is proposed, which will be based on the Princeton WordNet that was developed at Princeton University, USA. All relevant concepts are introduced, starting with an analysis of the prototype wordnet, the Princeton WordNet, after which the focus shifts to its extension to multilingual wordnet databases. An investigation is made into the available resources and tools available for the proposed project, after which we verify its feasibility. Afterwards, various methodologies for wordnet construction are investigated and analysed. Based on all these analyses, a detailed methodology with a work schedule is proposed for building a core Afrikaans wordnet, also keeping in mind future extension and potential problems. We conclude that the existence of several automatic techniques have greatly improved the process of wordnet construction compared to a few years ago, but that they are still heavily dependent on quality lexical resources and tools. Finally, some suggestions are made for future extensions and applications of the wordnet.

@MastersThesis{Kotze:Thesis:2006,

author = {Kotz\'{e}, Gideon},

title = {Building a WordNet for Afrikaans: Preliminary research in the form of an inquiry into the feasibility and optimal methodology for the development of a wordnet database},

school = {Free University of Amsterdam},

address = {Amsterdam, The Netherlands},

year = {2006},

url = {http://www.gideonkotze.co.za/downloads/MastersThesis_Kotze2006.pdf}

}

Gideon J. Kotzé

Home

This is a list of my academic publications.

2022

2020

2019

2017

2016

2015

2014

2013

2012

2011

2009

2008

2006