Show simple item record

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Brglez, Mojca
dc.contributor.author Rupnik, Peter
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2021-12-03T10:38:56Z
dc.date.available 2021-12-03T10:38:56Z
dc.date.issued 2021-12-02
dc.identifier.uri http://hdl.handle.net/11356/1467
dc.description The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and a web corpus, crawled in 2021 in the scope of the MaCoCu project. The corpus allows for automated genre identification and genre analyses as well as other web corpora research, and comprises two parts: - subcorpus of suitable texts, containing 1002 texts (478,969 words), manually annotated with 24 genre categories (News/Reporting, Announcement, Research Article, Instruction, Recipe, Call (such as a Call for Papers), Legal/Regulation, Information/Explanation, Opinionated News, Review, Opinion/Argumentation, Promotion of a Product, Promotion of Services, Invitation, Promotion, Interview, Forum, Correspondence, Script/Drama, Prose, Lyrical, FAQ (Frequently Asked Questions), List of Summaries/Excerpts, and Other) - subcorpus of unsuitable texts, containing 123 texts (173,778 words), discarded as not suitable for genre annotation due to reasons, encoded by the labels (Machine Translation, Generated Text, Not Slovene, Encoding Issues, HTML Source Code, Boilerplate, Too Short/Incoherent, Too Long (longer than 5,000 words), Non-Textual (no full sentences, e.g. tables, lists), and Multiple texts). The texts in the suitable subset are annotated with up to three genre categories, where the primary label is the most prevalent, and secondary and tertiary labels denote presence of additional genre(s). They are encoded in three levels of detail, allowing experiments with the full set (24 labels), set of 21 labels (labels with less than 5 instances are merged with label Other) and set of 12 labels (similar labels are merged). Additionally, the corpus contains some metadata about the text (e.g. url, domain, year) and its paragraphs (e.g. near-duplicates and their usefulness for the genre identification).
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject manual annotation
dc.subject genre
dc.subject automatic genre identification
dc.subject web corpus
dc.subject genre corpus
dc.title Slovene Web genre identification corpus GINCO 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 1125 texts
size.info 652747 words
files.count 2
files.size 1856811


 Files in this item

 Download all files in item (1.77 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
GINCO-1.0-suitable.json.zip
Size
1.36 MB
Format
application/zip
Description
Subcorpus of GINCO 1.0 with texts, manually annotated with genre.
MD5
04b0a441b3e82c74bb7997a55dbda40c
 Download file  Preview
 File Preview  
Icon
Name
GINCO-1.0-nonsuitable.json.zip
Size
422.23 KB
Format
application/zip
Description
Subcorpus of GINCO 1.0 with texts, not suitable for genre annotation.
MD5
c18e0cc6c8cb6a28c2502ec680956fcd
 Download file  Preview
 File Preview  

Show simple item record