Please use the following text to cite this item or export to a predefined format:
University of Copenhagen, 2023, ensiwiki-2011 dataset for readability modelling, CLARIN-DK-UCPH Centre Repository, http://hdl.handle.net/20.500.12115/49.
dc.creatorVan der Sluis, Frans
dc.date.accessioned2023-09-26T10:02:54Z
dc.date.available2023-09-26T10:02:54Z
dc.date.issued2023-09-26
dc.descriptionThe ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular English Wikipedia if available. The result is a list of pairs between Simple-English and regular English pages. Only pages that form a pair were included. In total 138,790 pages were sampled from Simple-English Wikipedia and English Wikipedia from August, 2011. The purpose of this dataset is to train and test readability detection systems. The dataset is intended to be sufficiently large to detect intricate relations between different features of readability. The dataset is used for this purpose in Van der Sluis (2013, 2014) and is described in further detail in Van der Sluis (2013). The dataset furthermore contains plain text versions of the wiki-text pages. These were parsed using JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/) and split to the level of articles, sections, and paragraphs. Only the oldest 38,955 wikitext pages were parsed in order to arrive at a more mature set of pages that more clearly distinguishes between different levels of readability, which proved superior for training readability models. Note: This data is a result of work done at the Human-Media Interaction group of the University of Twente, The Netherlands. It's release is in accordance with original licensing requirements and aligned with relevant parties.
dc.identifier.urihttp://hdl.handle.net/20.500.12115/49
dc.language.isoeng
dc.publisherUniversity of Copenhagen
dc.relation.isreferencedbyhttps://doi.org/10.1002/asi.23095
dc.relation.isreferencedbyhttps://doi.org/10.3990/1.9789036505673
dc.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.labelPUB
dc.rights.urihttp://creativecommons.org/licenses/by-sa/4.0/
dc.subjectreadability
dc.subjecttextual complexity
dc.subjectwikipedia
dc.subjectsimple english
dc.titleensiwiki-2011 dataset for readability modelling
dc.typecorpus
local.annotationInfo.annotationTypereadability labels: simple (english) or (regular) english
local.brandingCLARIN-DK
local.contact.personFrans Van der Sluis frans@hum.ku.dk University of Copenhagen
local.files.count2
local.files.size961669085
local.has.filesyes
local.language.nameEnglish
local.size.info138790 articles
local.sponsoreuFunds FP7-ICT-2007-3 7th Framework ICT Programme of the European Union. PuppyIR
metashare.ResourceInfo#ContentInfo.mediaTypetext
 Files in this item
Name
ensiwiki2011.db.tar.gz
Size
917.12 MB
Format
application/gzip
Description
Tar+gzipped SQLite3 database file containing all data and metadata
MD5
664ccbe0aed88212a6863d29338a0632
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator
Name
readme.md
Size
3.59 KB
Format
application/octet-stream
Description
Readme
MD5
1f4a663fcf415e9cf2089d255f9124b1
Preview
  File Preview
    The file preview has not been generated yet. Please try again later or contact the system administrator