ensiwiki-2011 dataset for readability modelling

Van der Sluis, Frans

Please use the following text to cite this item or export to a predefined format:

University of Copenhagen, 2023, ensiwiki-2011 dataset for readability modelling, CLARIN-DK-UCPH Centre Repository, http://hdl.handle.net/20.500.12115/49.

Share

dc.creator	Van der Sluis, Frans
dc.date.accessioned	2023-09-26T10:02:54Z
dc.date.available	2023-09-26T10:02:54Z
dc.date.issued	2023-09-26
dc.description	The ensiwiki dataset contains Wikipedia pages sampled from Simple-English and regular English Wikipedia. For each Simple-English page, a paired page was sampled from the regular English Wikipedia if available. The result is a list of pairs between Simple-English and regular English pages. Only pages that form a pair were included. In total 138,790 pages were sampled from Simple-English Wikipedia and English Wikipedia from August, 2011. The purpose of this dataset is to train and test readability detection systems. The dataset is intended to be sufficiently large to detect intricate relations between different features of readability. The dataset is used for this purpose in Van der Sluis (2013, 2014) and is described in further detail in Van der Sluis (2013). The dataset furthermore contains plain text versions of the wiki-text pages. These were parsed using JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/) and split to the level of articles, sections, and paragraphs. Only the oldest 38,955 wikitext pages were parsed in order to arrive at a more mature set of pages that more clearly distinguishes between different levels of readability, which proved superior for training readability models. Note: This data is a result of work done at the Human-Media Interaction group of the University of Twente, The Netherlands. It's release is in accordance with original licensing requirements and aligned with relevant parties.
dc.identifier.uri	http://hdl.handle.net/20.500.12115/49
dc.language.iso	eng
dc.publisher	University of Copenhagen
dc.relation.isreferencedby	https://doi.org/10.1002/asi.23095
dc.relation.isreferencedby	https://doi.org/10.3990/1.9789036505673
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.label	PUB
dc.rights.uri	http://creativecommons.org/licenses/by-sa/4.0/
dc.subject	readability
dc.subject	textual complexity
dc.subject	wikipedia
dc.subject	simple english
dc.title	ensiwiki-2011 dataset for readability modelling
dc.type	corpus
local.annotationInfo.annotationType	readability labels: simple (english) or (regular) english
local.branding	CLARIN-DK
local.contact.person	Frans Van der Sluis frans@hum.ku.dk University of Copenhagen
local.files.count	2
local.files.size	961669085
local.has.files	yes
local.language.name	English
local.size.info	138790 articles
local.sponsor	euFunds FP7-ICT-2007-3 7th Framework ICT Programme of the European Union. PuppyIR
metashare.ResourceInfo#ContentInfo.mediaType	text

Collections

CLARIN-DK-UCPH Repository

This item isPublicly Available

and licensed under:

Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Files in this item

Download instructions for command line Download all files in item (917.12 MB)

Name: ensiwiki2011.db.tar.gz
Size: 917.12 MB
Format: application/gzip
Description: Tar+gzipped SQLite3 database file containing all data and metadata
MD5: 664ccbe0aed88212a6863d29338a0632

Download file Preview

File Preview

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk

Name: readme.md
Size: 3.59 KB
Format: application/octet-stream
Description: Readme
MD5: 1f4a663fcf415e9cf2089d255f9124b1

Download file

File Preview

The file preview has not been generated yet. Please try again later or contact the system administrator info@clarin.dk

Show simple item record