Slovene Web genre identification corpus GINCO 1.0

Name: Slovene Web genre identification corpus GINCO 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Kuzman, Taja; Brglez, Mojca; Rupnik, Peter; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Kuzman, Taja
dc.contributor.author	Brglez, Mojca
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2021-12-03T10:38:56Z
dc.date.available	2021-12-03T10:38:56Z
dc.date.issued	2021-12-02
dc.identifier.uri	http://hdl.handle.net/11356/1467
dc.description	The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and a web corpus, crawled in 2021 in the scope of the MaCoCu project. The corpus allows for automated genre identification and genre analyses as well as other web corpora research, and comprises two parts: - subcorpus of suitable texts, containing 1002 texts (478,969 words), manually annotated with 24 genre categories (News/Reporting, Announcement, Research Article, Instruction, Recipe, Call (such as a Call for Papers), Legal/Regulation, Information/Explanation, Opinionated News, Review, Opinion/Argumentation, Promotion of a Product, Promotion of Services, Invitation, Promotion, Interview, Forum, Correspondence, Script/Drama, Prose, Lyrical, FAQ (Frequently Asked Questions), List of Summaries/Excerpts, and Other) - subcorpus of unsuitable texts, containing 123 texts (173,778 words), discarded as not suitable for genre annotation due to reasons, encoded by the labels (Machine Translation, Generated Text, Not Slovene, Encoding Issues, HTML Source Code, Boilerplate, Too Short/Incoherent, Too Long (longer than 5,000 words), Non-Textual (no full sentences, e.g. tables, lists), and Multiple texts). The texts in the suitable subset are annotated with up to three genre categories, where the primary label is the most prevalent, and secondary and tertiary labels denote presence of additional genre(s). They are encoded in three levels of detail, allowing experiments with the full set (24 labels), set of 21 labels (labels with less than 5 instances are merged with label Other) and set of 12 labels (similar labels are merged). Additionally, the corpus contains some metadata about the text (e.g. url, domain, year) and its paragraphs (e.g. near-duplicates and their usefulness for the genre identification).
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/2022.lrec-1.170.pdf
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://macocu.eu/
dc.subject	manual annotation
dc.subject	genre
dc.subject	automatic genre identification
dc.subject	web corpus
dc.subject	genre corpus
dc.title	Slovene Web genre identification corpus GINCO 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	1125 texts
size.info	652747 words
files.count	2
files.size	1856811

Files in this item

Download all files in item (1.77 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: GINCO-1.0-suitable.json.zip
Size: 1.36 MB
Format: application/zip
Description: Subcorpus of GINCO 1.0 with texts, manually annotated with genre.
MD5: 04b0a441b3e82c74bb7997a55dbda40c

Download file Preview

File Preview

GINCO-1.0-suitable.json
- README.txt8 kB
- GINCO-1.0-suitable.json5 MB

Name: GINCO-1.0-nonsuitable.json.zip
Size: 422.23 KB
Format: application/zip
Description: Subcorpus of GINCO 1.0 with texts, not suitable for genre annotation.
MD5: c18e0cc6c8cb6a28c2502ec680956fcd

Download file Preview

File Preview

GINCO-1.0-nonsuitable.json
- README.txt4 kB
- GINCO-1.0-nonsuitable.json1 MB

Show simple item record

Files in this item

Partners

Partners

Repository