"Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA

Name: "Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Kuzman, Taja; Rupnik, Peter; Milosavljević, Stefan; Galant, Nada; Benčina, Sonja; Čibej, Jaka

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Kuzman, Taja
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Milosavljević, Stefan
dc.contributor.author	Galant, Nada
dc.contributor.author	Benčina, Sonja
dc.contributor.author	Čibej, Jaka
dc.date.accessioned	2024-04-24T16:05:16Z
dc.date.available	2024-04-24T16:05:16Z
dc.date.issued	2024-04-26
dc.identifier.uri	http://hdl.handle.net/11356/1766
dc.description	The DIALECT-COPA datasets comprise Choice of Plausible Alternatives (COPA) datasets for three South Slavic dialects: (1) COPA-SL-CER for the Cerkno dialect of Slovenian, spoken in the Slovenian Littoral region, specifically from the town of Idrija; (2) COPA-HR-CKM for the Chakavian dialect of Croatian from northern Adriatic, specifically from the town of Žminj; (3) COPA-SR-TOR for the Torlak dialect from southeastern Serbia, specifically from the town of Lebane. The datasets were translated from the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by native dialect speakers, following the XCOPA dataset translation methodology (https://arxiv.org/abs/2005.00333). A novelty in the DIALECT-COPA translation approach is that both English and the corresponding standard South Slavic language were at disposal to the translator during the translation process. Each instance consists of a premise (My body cast a shadow over the grass), a question (What is the cause? / What happened as a result?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising). The datasets follow the same format as the Croatian COPA-HR dataset (http://hdl.handle.net/11356/1404), the Macedonian COPA-MK dataset (http://hdl.handle.net/11356/1687) and the Serbian COPA-SR dataset (http://hdl.handle.net/11356/1708). Each dataset is split into training (400 instances) and validation (100 instances) JSONL files. The test split (500 instances), which is usually a part of the COPA datasets, has been withheld and can be shared upon request. The reason for this is to prevent its inclusion of the test instances in the training data of future large language models, which would invalidate the benchmark measurements. The DIALECT-COPA datasets are published as part of the DIALECT-COPA shared task at the VarDial 2024 workshop where they were used as gold data for evaluation of the performance of large language models on South Slavic dialects (https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa).
dc.language.iso	slv
dc.language.iso	hrv
dc.language.iso	srp
dc.language.iso	ckm
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/2024.vardial-1.7/
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa
dc.subject	commonsense reasoning
dc.subject	manual translation
dc.title	"Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	1500 items
size.info	6 files
size.info	279 kb
files.count	6
files.size	286402