CSTlemma version 8.1.2
Please use the following text to cite this item or export to a predefined format:
Centre for Language Technology, NorS, University of Copenhagen, 2021,
CSTlemma version 8.1.2, CLARIN-DK-UCPH Centre Repository,
http://hdl.handle.net/20.500.12115/45.
Authors
Item identifier
Project URL
Referenced by
Date issued
2021-05-21
Type
Description
CSTlemma is a lemmatizer that treats pre- in- and suffixes alike.
The CST's lemmatizer can be (and already is) trained for tens of languages, also ones that require lemmatization rules that change words by adding or removing prefixes and/or infixes to obtain the lemma for the word. In Dutch, for example, the word "afgemaakt" has the lemma "afmaken", so the "ge" has to be removed, an "a" has to be inserted and the "t"-ending must be replaced by "en".
New in version 8 of CSTlemma is the possibility to output the rule by which a given word is transformed to its lemma. It is also possible to just output a unique identifier for that rule - in practice, this identifier is just some kind of pointer in the datastructure that comprises the rule set.
Rules for CSTlemma must be created with the affixtrain program (https://github.com/kuhumcst/affixtrain), but ready-made rules can be obtained from the net. For example, the https://github.com/kuhumcst/texton-linguistic-resources repo contains rules for about 30 languages.
If you want to build CSTlemma, you not only need the source code contained in https://github.com/kuhumcst/cstlemma, but also some source code files from https://github.com/kuhumcst/letterfunc and from https://github.com/kuhumcst/parsesgml, The easiest and best way to go forward is to copy https://github.com/kuhumcst/cstlemma/blob/master/doc/makecstlemma.bash to a (linux, Mac?) folder and run that script. That will fetch all needed repositories and build cstlemma.
Subject(s)
Collections
Files in this item
- Name
- cstlemma-8.1.2.tar.gz
- Size
- 163.48 KB
- Format
- application/gzip
- Description
- Source code & Makefile
- MD5
- 627b300945873cdf284b8adece6e3555

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk

