Show simple item record

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Zgaga, Karolina
dc.contributor.author Arčon, Tjaša
dc.date.accessioned 2025-12-23T17:06:19Z
dc.date.available 2025-12-23T17:06:19Z
dc.date.issued 2025-12-23
dc.identifier.uri http://hdl.handle.net/11356/2076
dc.description The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to collocation meaning and/or form. Headwords and their collocations were obtained from the Collocations Dictionary of Modern Slovene (http://hdl.handle.net/11356/1933), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the collocations were that they had to be manually validated and assigned under one of the senses of the headword. The distractors were obtained with the gpt-4o-2024-08-06 (https://platform.openai.com/docs/models/gpt-4o) model, using the following prompt: “We are preparing a language game where the player will be given a headword, a collocation (combination of the headword and another word) and a distractor (a collocation that has the same headword, but the other word is not a collocate of the main word). For example "huge victory" is a collocation of "victory", but "rotten victory" is not so "rotten is a good distractor. The rules for forming distractors are the following: 1. Distractor has to be a single word. 2. Distractor has to have the same part of speech as the word being replaced (e.g. if the word next to the headword is a noun, the distractor should also be a noun). 3. Distractor should not include sensitive vocabulary, e.g. related to minorities, nationalities, religion, sexual content and similar. 4. Distractor has to be a word that is frequent in the Slovene language. 5. Distractor has to be a word that is completely unlikely to occur with the headword. Return the distractor in the same format as the examples below: Example: hiter - hitre rešitve (hiter + rešitve) - hitre težave (hiter + težava) Example: obljuba - držati obljubo (držati + obljuba) - najti obljubo (najti + obljuba). This is the headword: {headword}. This is the collocation: {collocation} ({all_collocation_parts}). The distractor has to be a collocation that contains the headword {headword} but is unlikely to occur with it. Only return the distractor in the correct format with the given headword {headword}. No explanations, no other text.” The evaluation was conducted in two parts, automatic and manual. Five decisions were used: good, good-possible_collocation, bad-wrong_headword, bad-same_as_collocation, bad-grammar_problem, bad-collocation. The automatic evaluation used the frequency information from the data warehouse containing all the collocations from the Gigafida 2.0 corpus. If the distractor was found there, it was considered bad (bad-collocation), otherwise good. We also identified the distractors that were the same as the collocations provided (bad-same_as_collocation), and the distractors where the headword was not kept (bad-wrong_headword). The remaining distractors were manually evaluated and those with grammatical problems (collocate and headword not matching in case, number, gender etc.) were labelled as bad (bad-grammar_problem). All the others were considered good (45,772 out of 59,598, or 77 %), however we also annotated those that were deemed as possible collocations (i.e. the combination sounded as possible in the language, e.g. to party loudly) (good-possible_collocation). The dataset also includes the information on the frequency and logDice statistics of collocations and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320).
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby https://elex.link/elex2025/wp-content/uploads/eLex2025-37-KosemArhar-Holdt.pdf
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/en/
dc.subject collocations
dc.subject distractors
dc.subject large language models
dc.subject manual annotation
dc.title Dataset of annotated collocation-distractor pairs COLLDIST
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@fri.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 59598 entries
files.count 1
files.size 1534093


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
COLLDIST.zip
Size
1.46 MB
Format
application/zip
Description
database + Readme
MD5
b05fdc75fa1c7d88c700fcfa883ada92
 Download file  Preview
 File Preview  
    • Readme.txt-1 B
    • COLLDIST-collocations-distractors-annotated.tsv-1 B

Show simple item record