Prikaži enostavni zapis vnosa

 
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Antloga, Špela
dc.contributor.author Gantar, Polona
dc.contributor.author Munda, Tina
dc.contributor.author Robida, Nejc
dc.contributor.author Zgonc, Matjaž
dc.date.accessioned 2025-09-30T16:20:25Z
dc.date.available 2025-09-30T16:20:25Z
dc.date.issued 2025-09-30
dc.identifier.uri http://hdl.handle.net/11356/2052
dc.description DASSLE 1.0 (Dataset of Authentic and Synthetic Slovene Language Errors) comprises 7,385 manually prepared entries, each consisting of a Slovene sentence containing a single, specific language problem, its corrected version, and metadata including both coarse- and fine-grained correction classifications, as well as the source of the example. Language problems are divided into five top-level categories: spelling, orthography, morphology, vocabulary, and syntax. These are further specified using 128 fine-grained error types, aligned with the typology developed for the Šolar 3.0 corpus. The typology is explained at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/introduction-to-tags and in more detail in the annotation guidelines at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines. The examples in DASSLE 1.0 were sourced from four distinct origins, combining both authentic and synthetic data creation. From Šolar 3.0, the corpus of student writing with teacher-provided corrections, sentences were manually reviewed and edited to contain only one clearly defined language problem. In Gigafida 2.0, the reference corpus of standard written Slovene, examples were either manually corrected or deliberately corrupted to introduce typical deviations from the current norm. Synthetic examples were generated using GPT-4o, which was prompted with authentic sentence pairs; outputs were manually reviewed to select only those most closely resembling natural language use. A small number of examples were collected from Jezikovna svetovalnica, based on real language queries submitted by speakers. The dataset is primarily intended for the development and evaluation of natural language processing tools for automatic error detection and correction for Slovene. It is available in TSV format, accompanied by a README document that describes its contents in more detail.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/en/work-packages/work-package-2/#task-2.3
dc.subject grammatical error correction
dc.subject language problem
dc.subject error annotation
dc.subject evaluation
dc.title Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Špela Arhar Holdt spela.arharholdt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 7385 entries
size.info 128 classes
files.count 1
files.size 378996


 Datoteke v tem vnosu

Icon
Ime
DASSLE-v1.zip
Velikost
370.11 KB
Format
application/zip
Opis
README and TSV data
MD5
3c669371908b8d468b528b1df27ad461
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • DASSLE-v1
    • DASSLE-v1-DATA.tsv1 MB
    • DASSLE-v1-README.md3 kB

Prikaži enostavni zapis vnosa