iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

Pintard, Alice; François, Thomas; Justine, Nagant de Deuxchaisnes; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; Amaro, Raquel; Correia, Susana; Rodríguez Rey, Sandra; Mu, Keran; Garcia González, Marcos; Bernárdez Braña, André; Blanco Escoda, Xavier

doi:10.5281/ZENODO.12821882

iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

Pintard, Alice ¹²
François, Thomas ¹²
Justine, Nagant de Deuxchaisnes ¹²
Barbosa, Sílvia ³⁴
Reis, Maria Leonor ³⁴
Moutinho, Michell ³⁴
Monteiro, Ricardo ³⁴
Amaro, Raquel ³⁴
Correia, Susana ³⁴
Rodríguez Rey, Sandra ⁵⁶
Mu, Keran ⁷
Garcia González, Marcos ⁵⁶
Bernárdez Braña, André ⁵⁶
Blanco Escoda, Xavier ⁷

1 CENTAL
2 UCLouvain
3 CLUNL
4 NOVA FCSH
5 CITIUS
6 Universidade de Santiago de Compostela

Universidade de Santiago de Compostela

Santiago de Compostela, España

ROR https://ror.org/030eybx10
7 Universitat Autònoma de Barcelona

Universitat Autònoma de Barcelona

Barcelona, España

ROR https://ror.org/052g8jq94

Erakutsi afiliazioak +

Argitaratzaile: Zenodo

Argitalpen urtea: 2024

Mota: Dataset

CC BY 4.0

DOI: 10.5281/ZENODO.12821882 Sarbide irekia editor

Laburpena

The iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in xlsx format. These corpora were compiled, classified and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com/). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a subset of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers were asked to classify the texts according to the complexity levels of the project, here informally defined as: Very Easy (everyone can understand the text or most of the text). Easy (a person with less than the 9th year of schooling can understand the text or most of the text) Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it) More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it). They were also asked to annotate the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), according to following categories: Lexical/word-related features - unknown word - word too technical/specialized or archaic - complex derived word - points to a previous reference that is not obvious - word (other) Syntactic/sentence-level features - unusual word order - too much embedded secondary information - too many connectors in the same sentence - sentence (other) - other (please specify) The sets were divided in three parts in Qualtrics and, in each part, the texts are shown randomly to the annotator. Students were asked to confirm that they could read without difficulty texts adequate to their literacy level. Each set contained texts from a given level, plus one text of the level immediately above. They were also asked to annotate words or sequences of words in the text that they did not understand, according to the following categories: - difficult word - difficult part of the text The complete results and datasets are in TSV/Excel format, in pairs of two files, with one file concerning the results from the classification (trainers)/validation (students) task and one file concerning the results from the annotation task. The complete datasets will be available under creative CC BY-NC-ND 4.0

iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

Universidade de Santiago de Compostela

Universitat Autònoma de Barcelona

Laburpena