The SemDaX Corpus
Please use the following text to cite this item or export to a predefined format:
Centre for Language Technology, NorS, University of Copenhagen, 2015,
The SemDaX Corpus, CLARIN-DK-UCPH Centre Repository,
http://hdl.handle.net/20.500.12115/38.
Authors
Pedersen, Bolette Sandford ; et al.
Item identifier
Date issued
2015
Size
90000 words,
673 files
Language(s)
Description
The SemDax Corpus is a Danish human-annotated corpus relying on the combined wordnet and dictionary resources: DanNet and Den Danske Ordbog, and available through a CLARIN academic license. The corpus includes approx. 90,000 words, comprises six textual domains, and is annotated with sense inventories of different granularity. All nouns, verbs and adjectives in the corpus were annotated with supersenses (all-words task). Furthermore, 20 very polysemous nouns were annotated with all the senses from the Den Danske Ordbog and a reduced set of clustered senses respectively.
The aim of the developed corpus is twofold: i) to assess the reliability of the different sense annotation schemes for Danish measured by qualitative analyses and annotation agreement scores, and ii) to serve as training and test data for machine learning algorithms with the practical purpose of developing sense taggers for Danish.
To these aims, we take a new approach to human-annotated corpus resources by double annotating a much larger part of the corpus than what is normally seen: for the all-words task we double annotated 60% of the material and for the lexical sample task 100%. We include in the corpus not only the curated files, but also the diverging annotations. In other words, we consider not all disagreement to be noise, but rather to contain valuable linguistic information that can help us improve our annotation schemes and our learning algorithms.
Acknowledgement
Forskningsrådet for Kultur og Kommunikation
Project code:DFF-1319-00123
Project name:Semantic Processing across Domains
Subject(s)
Collections
Files in this item
- Name
- lexicalsample.zip
- Size
- 6.38 MB
- Format
- application/zip
- Description
- Lexical sample annotations
- MD5
- c4dd63180cd72d5225b3190ecb65db58

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk
- Name
- README.md
- Size
- 2.38 KB
- Format
- application/octet-stream
- Description
- Readme
- MD5
- 07aa20d002dbea3e7adb89e51aa21430

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk
- Name
- SemDax-supersenses.zip
- Size
- 572.27 KB
- Format
- application/zip
- Description
- All words supersense annotations
- MD5
- 92345cdd96051473e146d3018323bd91

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk
- Name
- LICENSE
- Size
- 1.33 KB
- Format
- application/octet-stream
- Description
- License
- MD5
- 85b100e5d075024f48089b7a4eb34a51

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk

