Show simple item record

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-09-25T09:10:36Z
dc.date.available 2024-09-25T09:10:36Z
dc.date.issued 2024-09-25
dc.identifier.uri http://hdl.handle.net/11356/1960
dc.description The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and genre analyses as well as other web corpora research. Inter alia, it was used for the development of the multilingual X-GENRE classifier (http://hdl.handle.net/11356/1961). The X-GENRE dataset was constructed by merging three manually-annotated datasets by mapping the original schemata to the joint genre schema (the "X-GENRE schema"): 1) the Slovenian GINCO dataset (http://hdl.handle.net/11356/1467), 2) the English CORE dataset (https://github.com/TurkuNLP/CORE-corpus), and 3) the English FTD dataset (https://github.com/ssharoff/genre-keras). All of the original genre datasets are based on web corpora. The X-GENRE schema consists of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the README provided with the files for the details on the labels). The dataset is separated into train, development and test split. The train split consists of 1,772 texts and 1,940,317 words, the development split of 592 texts and 798,025 words, and the test split of 592 texts and 583,595 words. The splits are stratified by labels. As the dataset consists of two English datasets and one Slovenian dataset, the distribution of texts in the two languages is roughly two to one: 2,063 English texts and 893 Slovenian texts. The dataset is in JSONL format. It has the following attributes: text (text instance), labels (genre label), dataset (original manually-annotated genre dataset from which the instance was obtained – CORE, GINCO or FTD), and language (language of the text – Slovenian or English). This work received funding from the European Union’s Connecting Europe Facility 2014–2020 – CEF Telecom – under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the authors’ views. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso slv
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.3390/make5030059
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject manual annotation
dc.subject automatic genre identification
dc.subject genre corpus
dc.subject genre
dc.title English-Slovenian text genre dataset X-GENRE
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset
contact.person Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) N06-0099 and FWO-G070619N Linguistic landscape of hate speech on social media nationalFunds
size.info 3321937 words
size.info 2956 texts
files.count 1
files.size 6859836


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
X-GENRE-dataset.zip
Size
6.54 MB
Format
application/zip
Description
X-GENRE dataset: train, dev and test splits, and the README file
MD5
5b3d4c586a584ef37f9da1e81173abd6
 Download file  Preview
 File Preview  
  • X-GENRE-dataset
    • README.md6 kB
    • X-GENRE-dev.jsonl3 MB
    • X-GENRE-train.jsonl10 MB
    • X-GENRE-test.jsonl4 MB

Show simple item record