DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de)

Please use the following text to cite this item or export to a predefined format:
Centre for Language Technology, NorS, University of Copenhagen and European Commission, 2012, DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de), CLARIN-DK-UCPH Centre Repository, http://hdl.handle.net/20.500.12115/30.
Date issued
2012
Size
5000000 tokens,
270000 sentences
Language(s)
Description
The aligned corpus consists of press releases from the European Commission Press Relase Database (Rapid) harvested in 2009 and 2011 (http://europa.eu/rapid/search.htm). The corpus comprises 5330 + 2200 press releases (files) for each language Danish, English and German with app. 5,000,000 words per language and 260,000 - 270,000 aligned sentences for the language pair Danish - English and Danish - German. All documents are processed with Uplug (https://bitbucket.org/tiedemann/uplug/wiki/Home) and aligned with HunAlign. Files with more than 10 % negative alignments have been removed and so has all 0-alignmants. The documents are in txt-format for each language and in tmx-format for the aligned language pairs (da-en and da-de).
This item isAcademic Use
and licensed under:
 Files in this item
Name
Rapid-2004-2011.zip
Size
39.2 MB
Format
application/zip
Description
Corpus 2004 - 2011
MD5
ce84f48a004e249fcbe511faf0856e77
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator
Name
README.txt
Size
1.01 KB
Format
text/plain
Description
Documentation
MD5
8a7d86a2ef03a56751b93a15b60a4d63
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator
Name
Rapid-1993-2003.zip
Size
68.55 MB
Format
application/zip
Description
Corpus 1993 - 2003
MD5
d73a47ab17a22afeff024a360100e907
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator