iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP
- Pintard, Alice 12
- François, Thomas 12
- Nagant de Deuxchaisnes, Justine 12
- Barbosa, Sílvia 34
- Reis, Maria Leonor 34
- Moutinho, Michell 34
- Monteiro, Ricardo 34
- Amaro, Raquel 34
- Correia, Susana 34
- Rodríguez Rey, Sandra 56
- Garcia González, Marcos 56
- Mu, Keran 7
- Blanco Escoda, Xavier 7
- 1 CENTAL
-
2
Université Catholique de Louvain
info
- 3 CLUNL
-
4
Universidade Nova de Lisboa
info
- 5 CITIUS
-
6
Universidade de Santiago de Compostela
info
-
7
Universitat Autònoma de Barcelona
info
Editor: Zenodo
Year of publication: 2023
Type: Dataset
Abstract
The iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP is a collection of written texts of several genres and levels of complexity, in txt format, compiled under the scope of the project iReadSkills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development. The project, funded by the European Commission (grant number: 1010094837) aims to improve reading skills in the adult population by creating an intelligent system that assesses text complexity and suggests appropriate reading materials to adults with low literacy skills, contributing to reducing skills gaps and to provide access to information and culture (https://iread4skills.com/). The compilation of this first dataset was based on the complexity levels established as relevant for the project (Very Easy (approx. A1), Easy (approx. A2) and Clear (approx. B1) and on the expected needs of learners and trainers. For some genres, there are also texts of a more complex level. The data will provide the basis for the training and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The dataset will be further enhanced, validated, and annotated by end-users, originating forthcoming versions and a second, derived, dataset. The resource is composed of three sub corpora: French, Portuguese and Spanish. Each of the sub corpora considers different complexity levels and covers texts from the following communication domains: 01_personal communication; 02_institutional/professional communication; 03_social media; 04_commercial communication/dissemination; 05_non-fiction book; 06_fiction book; 07_didactic book; 08_academic/school; 09_political communication/dissemination; 10_legal documentation; 11_religious texts/dissemination. French corpus: Number of texts: 1271 Number of tokens: 315 930 Spanish corpus: Number of texts: 2000 Number of tokens: 889 857 Portuguese corpus: Number of texts: 2186 Number of tokens: 942 818