Slovene Natural Language Inference Dataset SI-NLI

Name: Slovene Natural Language Inference Dataset SI-NLI
License: https://creativecommons.org/licenses/by-nc-sa/4.0/
Keywords: natural language inference

Klemen, Matej; Žagar, Aleš; Čibej, Jaka; Robnik-Šikonja, Marko

Prikaži enostavni zapis vnosa

dc.contributor.author	Klemen, Matej
dc.contributor.author	Žagar, Aleš
dc.contributor.author	Čibej, Jaka
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2022-11-13T10:01:36Z
dc.date.available	2022-11-13T10:01:36Z
dc.date.issued	2022-11-13
dc.identifier.uri	http://hdl.handle.net/11356/1707
dc.description	SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets. The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.subject	natural language inference
dc.title	Slovene Natural Language Inference Dataset SI-NLI
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Klemen Matej.Klemen@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person	Aleš Žagar Ales.Zagar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person	Jaka Čibej Jaka.Cibej@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	5937 units
files.count	1
files.size	410093