Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Kuzman, Taja
dc.contributor.author Rupnik, Peter
dc.contributor.author Milosavljević, Stefan
dc.contributor.author Galant, Nada
dc.contributor.author Benčina, Sonja
dc.contributor.author Čibej, Jaka
dc.date.accessioned 2024-04-24T16:05:16Z
dc.date.available 2024-04-24T16:05:16Z
dc.date.issued 2024-04-26
dc.identifier.uri http://hdl.handle.net/11356/1766
dc.description The DIALECT-COPA datasets comprise Choice of Plausible Alternatives (COPA) datasets for three South Slavic dialects: (1) COPA-SL-CER for the Cerkno dialect of Slovenian, spoken in the Slovenian Littoral region, specifically from the town of Idrija; (2) COPA-HR-CKM for the Chakavian dialect of Croatian from northern Adriatic, specifically from the town of Žminj; (3) COPA-SR-TOR for the Torlak dialect from southeastern Serbia, specifically from the town of Lebane. The datasets were translated from the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by native dialect speakers, following the XCOPA dataset translation methodology (https://arxiv.org/abs/2005.00333). A novelty in the DIALECT-COPA translation approach is that both English and the corresponding standard South Slavic language were at disposal to the translator during the translation process. Each instance consists of a premise (My body cast a shadow over the grass), a question (What is the cause? / What happened as a result?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising). The datasets follow the same format as the Croatian COPA-HR dataset (http://hdl.handle.net/11356/1404), the Macedonian COPA-MK dataset (http://hdl.handle.net/11356/1687) and the Serbian COPA-SR dataset (http://hdl.handle.net/11356/1708). Each dataset is split into training (400 instances) and validation (100 instances) JSONL files. The test split (500 instances), which is usually a part of the COPA datasets, has been withheld and can be shared upon request. The reason for this is to prevent its inclusion of the test instances in the training data of future large language models, which would invalidate the benchmark measurements. The DIALECT-COPA datasets are published as part of the DIALECT-COPA shared task at the VarDial 2024 workshop where they were used as gold data for evaluation of the performance of large language models on South Slavic dialects (https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa).
dc.language.iso slv
dc.language.iso hrv
dc.language.iso srp
dc.language.iso ckm
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa
dc.subject commonsense reasoning
dc.subject manual translation
dc.title "Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 1500 items
size.info 6 files
size.info 279 kb
files.count 6
files.size 286402


 Files in this item

 Download all files in item (279.69 KB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
copa-sl-cer.train.jsonl
Size
72.64 KB
Format
Unknown
Description
Training split of COPA-SL-CER
MD5
e7f31b5a4c4f1677d67e77132535a5dc
 Download file
Icon
Name
copa-sl-cer.val.jsonl
Size
18.45 KB
Format
Unknown
Description
Validation split of COPA-SL-CER
MD5
09bbfe4d5dce661e748bb798e943f032
 Download file
Icon
Name
copa-hr-ckm.train.jsonl
Size
76.68 KB
Format
Unknown
Description
Training split of COPA-HR-CKM
MD5
5c84b2efc1fe43321a8e13013a1279a7
 Download file
Icon
Name
copa-hr-ckm.val.jsonl
Size
19.4 KB
Format
Unknown
Description
Validation split of COPA-HR-CKM
MD5
c5a20701d803fee42cba90c7878fc72b
 Download file
Icon
Name
copa-sr-tor.train.jsonl
Size
74.18 KB
Format
Unknown
Description
Training split of COPA-SR-TOR
MD5
630aca10135c12d668a68ab1e62375b4
 Download file
Icon
Name
copa-sr-tor.val.jsonl
Size
18.34 KB
Format
Unknown
Description
Validation split of COPA-SR-TOR
MD5
9ef13b5ab79bd9e2965f16147fea1816
 Download file

Show simple item record