Semantic change detection datasets for Slovenian 1.0

Name: Semantic change detection datasets for Slovenian 1.0
License: https://opensource.org/licenses/mit-license.php

Martinc, Matej; Dobrovoljc, Kaja; Pollak, Senja

Show simple item record

dc.contributor.author	Martinc, Matej
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Pollak, Senja
dc.date.accessioned	2022-07-18T13:05:04Z
dc.date.available	2022-07-18T13:05:04Z
dc.date.issued	2022-07-15
dc.identifier.uri	http://hdl.handle.net/11356/1651
dc.description	This dataset is meant for evaluation of systems for semantic change detection in Slovenian. The entry contains the following files: 1) "gigafida_to_1997_vs_2018.tsv" - contains sources from the Gigafida 2.0 reference corpus (http://www.gigafida.net/), dating either from year 1997 (or earlier) or year 2018. The corpus, which can be used for training, domain adaptation or word representation extraction is in a .tsv format with 4 columns: - 'title': Title of the text - 'publisher': Text's publisher name - 'date': Year of the text's publishing - 'type': Text's type (e.g., whether text was scraped from the internet or it appeared in print) - 'text': Text in non-processed form 2) "word_usage_annotations_1997_2018.tsv" - contains example word usages for 105 predefined words. For each word, we extract from the Gigafida 2.0 corpus 30 usage examples (sentences) from year 1997 and 30 usage examples from year 2018. The sentences from both time periods are randomly matched (e.g. each pair contains a random sentence from 1997 and a random sentence from 2018, both containing the same target word), resulting in 3150 sentence pairs. These pairs were annotated by three human annotators on a scale from 1 to 4: 1: usages in the sentences are unrelated 2: usages in the sentences are distantly related 3: usages in the sentences are closely related 4: usages are identical, i.e. they have the same sense Label 0 was also allowed, meaning "I can't decide", e.g. due to insufficient context. The file in the .tsv format contains the following columns: - 'id': id of the sentence pair - 'word': target word - 'sentence 1997': sentence from year 1997 - 'sentence 2018': sentence from year 2018 - 'score_anno1': score given by annotator 1 - 'score_anno2': score given by annotator 2 - 'score_anno3': score given by annotator 3 3) "semantic_shift_scores.tsv": contains final "gold standard" scores for each word, obtained by averaging scores across sentence pairs and across all three annotators in order to obtain a single numerical value for each word in the list. The examples containing zeros were excluded and the word 'zenit' was excluded from the list due to too many sentence pairs containing zeros. The file in the .tsv format contains the following columns: - 'word': target word - 'score': semantic change score 4) "RSDO_semanticni-premiki_navodila_v0.pdf": annotation guidelines (in Slovenian)
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	The MIT License (MIT)
dc.rights.uri	https://opensource.org/licenses/mit-license.php
dc.rights.label	PUB
dc.source.uri	https://slovenscina.eu/
dc.subject	semantic change detection
dc.subject	Slovenian
dc.subject	evaluation
dc.title	Semantic change detection datasets for Slovenian 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Martinc matej.martinc@ijs.si Jožef Stefan Institute
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count	1
files.size	293946582

Files in this item

This item is

Publicly Available

and licensed under:
The MIT License (MIT)

Name: slovenian_semantic_shift_dataset.zip
Size: 280.33 MB
Format: application/zip
Description: Dataset with annotation guidelines (in Slovenian)
MD5: e183282df2e5957b519a87a71d9b63ae

Download file Preview

File Preview

slovenian_semantic_shift_dataset
- gigafida_to_1997_vs_2018.tsv834 MB
- semantic_shift_scores.tsv1 kB
- readme.txt2 kB
- word_usage_annotations_1997_2018.tsv963 kB
- RSDO_semanticni-premiki_navodila_v0.pdf399 kB

Show simple item record

Files in this item

Partners

Partners

Repository