Show simple item record

 
dc.contributor.author Martinc, Matej
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Pollak, Senja
dc.date.accessioned 2022-07-18T13:05:04Z
dc.date.available 2022-07-18T13:05:04Z
dc.date.issued 2022-07-15
dc.identifier.uri http://hdl.handle.net/11356/1651
dc.description This dataset is meant for evaluation of systems for semantic change detection in Slovenian. The "semantic_shift_gs_dataset folder contains 3 files: 1) "gigafida_to_1997_vs_2018.tsv" - contains sources from the Gigafida 2.0 reference corpus (http://www.gigafida.net/), dating either from year 1997 (or earlier) or year 2018. The corpus, which can be used for training, domain adaptation or word representation extraction is in a .tsv format with 4 columns: - 'title': Title of the text - 'publisher': Text's publisher name - 'date': Year of the text's publishing - 'type': Text's type (e.g., whether text was scraped from the internet or it appeared in print) - 'text': Text in non-processed form 2) "word_usage_annotations_1997_2018.tsv" - contains example word usages for 105 predefined words. For each word, we extract from the Gigafida 2.0 corpus 30 usage examples (sentences) from year 1997 and 30 usage examples from year 2018. The sentences from both time periods are randomly matched (e.g. each pair contains a random sentence from 1997 and a random sentence from 2018, both containing the same target word), resulting in 3150 sentence pairs. These pairs were annotated by three human annotators on a scale from 1 to 4: 1: usages in the sentences are unrelated 2: usages in the sentences are distantly related 3: usages in the sentences are closely related 4: usages are identical, i.e. they have the same sense Label 0 was also allowed, meaning "I can't decide", e.g. due to insufficient context. The file in the .tsv format contains the following columns: - 'id': id of the sentence pair - 'word': target word - 'sentence 1997': sentence from year 1997 - 'sentence 2018': sentence from year 2018 - 'score_anno1': score given by annotator 1 - 'score_anno2': score given by annotator 2 - 'score_anno3': score given by annotator 3 3) "semantic_shift_scores.tsv": contains final "gold standard" scores for each word, obtained by averaging scores across sentence pairs and across all three annotators in order to obtain a single numerical value for each word in the list. The examples containing zeros were excluded and the word 'zenit' was excluded from the list due to too many sentence pairs containing zeros. The file in the .tsv format contains the following columns: - 'word': target word - 'score': semantic change score
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://slovenscina.eu/
dc.subject semantic change detection
dc.subject Slovenian
dc.subject evaluation
dc.title Semantic change detection datasets for Slovenian 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Matej Martinc matej.martinc@ijs.si Jožef Stefan Institute
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count 1
files.size 293574337


 Files in this item

This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
slovenian_semantic_shift_dataset.zip
Size
279.97 MB
Format
application/zip
Description
slovenian_semantic_shift_dataset
MD5
85a2883b1e787ebf738ac193527ae84a
 Download file  Preview
 File Preview  

Show simple item record