English-Slovenian text genre dataset X-GENRE

Name: English-Slovenian text genre dataset X-GENRE
License: https://creativecommons.org/licenses/by-sa/4.0/

Kuzman, Taja; Ljubešić, Nikola

dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2024-09-25T09:10:36Z
dc.date.available	2024-09-25T09:10:36Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/11356/1960
dc.description	The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and genre analyses as well as other web corpora research. Inter alia, it was used for the development of the multilingual X-GENRE classifier (http://hdl.handle.net/11356/1961). The X-GENRE dataset was constructed by merging three manually-annotated datasets by mapping the original schemata to the joint genre schema (the "X-GENRE schema"): 1) the Slovenian GINCO dataset (http://hdl.handle.net/11356/1467), 2) the English CORE dataset (https://github.com/TurkuNLP/CORE-corpus), and 3) the English FTD dataset (https://github.com/ssharoff/genre-keras). All of the original genre datasets are based on web corpora. The X-GENRE schema consists of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the README provided with the files for the details on the labels). The dataset is separated into train, development and test split. The train split consists of 1,772 texts and 1,940,317 words, the development split of 592 texts and 798,025 words, and the test split of 592 texts and 583,595 words. The splits are stratified by labels. As the dataset consists of two English datasets and one Slovenian dataset, the distribution of texts in the two languages is roughly two to one: 2,063 English texts and 893 Slovenian texts. The dataset is in JSONL format. It has the following attributes: text (text instance), labels (genre label), dataset (original manually-annotated genre dataset from which the instance was obtained – CORE, GINCO or FTD), and language (language of the text – Slovenian or English). This work received funding from the European Union’s Connecting Europe Facility 2014–2020 – CEF Telecom – under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the authors’ views. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso	slv
dc.language.iso	eng
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.3390/make5030059
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.subject	manual annotation
dc.subject	automatic genre identification
dc.subject	genre corpus
dc.subject	genre
dc.title	English-Slovenian text genre dataset X-GENRE
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset
contact.person	Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) N06-0099 and FWO-G070619N Linguistic landscape of hate speech on social media nationalFunds
size.info	3321937 words
size.info	2956 texts
files.count	1
files.size	6859836

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: X-GENRE-dataset.zip
Size: 6.54 MB
Format: application/zip
Description: X-GENRE dataset: train, dev and test splits, and the README file
MD5: 5b3d4c586a584ef37f9da1e81173abd6

Download file Preview

File Preview

X-GENRE-dataset
- README.md6 kB
- X-GENRE-dev.jsonl3 MB
- X-GENRE-train.jsonl10 MB
- X-GENRE-test.jsonl4 MB

Show simple item record

Files in this item

Partners

Partners

Repository