DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de)
Please use the following text to cite this item or export to a predefined format:
Centre for Language Technology, NorS, University of Copenhagen and European Commission, 2012,
DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de), CLARIN-DK-UCPH Centre Repository,
http://hdl.handle.net/20.500.12115/30.
Authors
Item identifier
Date issued
2012
Size
5000000 tokens,
270000 sentences
Description
The aligned corpus consists of press releases from the European Commission Press Relase Database (Rapid) harvested in 2009 and 2011 (http://europa.eu/rapid/search.htm).
The corpus comprises 5330 + 2200 press releases (files) for each language Danish, English and German with app. 5,000,000 words per language and 260,000 - 270,000 aligned sentences for the language pair Danish - English and Danish - German.
All documents are processed with Uplug (https://bitbucket.org/tiedemann/uplug/wiki/Home) and aligned with HunAlign.
Files with more than 10 % negative alignments have been removed and so has all 0-alignmants.
The documents are in txt-format for each language and in tmx-format for the aligned language pairs (da-en and da-de).
Subject(s)
Collections
Files in this item
- Name
- Rapid-2004-2011.zip
- Size
- 39.2 MB
- Format
- application/zip
- Description
- Corpus 2004 - 2011
- MD5
- ce84f48a004e249fcbe511faf0856e77

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk
- Name
- README.txt
- Size
- 1.01 KB
- Format
- text/plain
- Description
- Documentation
- MD5
- 8a7d86a2ef03a56751b93a15b60a4d63

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk
- Name
- Rapid-1993-2003.zip
- Size
- 68.55 MB
- Format
- application/zip
- Description
- Corpus 1993 - 2003
- MD5
- d73a47ab17a22afeff024a360100e907

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk

