Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0

Name: Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Arhar Holdt, Špela; Antloga, Špela; Gantar, Polona; Munda, Tina; Robida, Nejc; Zgonc, Matjaž

Prikaži enostavni zapis vnosa

dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Antloga, Špela
dc.contributor.author	Gantar, Polona
dc.contributor.author	Munda, Tina
dc.contributor.author	Robida, Nejc
dc.contributor.author	Zgonc, Matjaž
dc.date.accessioned	2025-09-30T16:20:25Z
dc.date.available	2025-09-30T16:20:25Z
dc.date.issued	2025-09-30
dc.identifier.uri	http://hdl.handle.net/11356/2052
dc.description	DASSLE 1.0 (Dataset of Authentic and Synthetic Slovene Language Errors) comprises 7,385 manually prepared entries, each consisting of a Slovene sentence containing a single, specific language problem, its corrected version, and metadata including both coarse- and fine-grained correction classifications, as well as the source of the example. Language problems are divided into five top-level categories: spelling, orthography, morphology, vocabulary, and syntax. These are further specified using 128 fine-grained error types, aligned with the typology developed for the Šolar 3.0 corpus. The typology is explained at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/introduction-to-tags and in more detail in the annotation guidelines at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines. The examples in DASSLE 1.0 were sourced from four distinct origins, combining both authentic and synthetic data creation. From Šolar 3.0, the corpus of student writing with teacher-provided corrections, sentences were manually reviewed and edited to contain only one clearly defined language problem. In Gigafida 2.0, the reference corpus of standard written Slovene, examples were either manually corrected or deliberately corrupted to introduce typical deviations from the current norm. Synthetic examples were generated using GPT-4o, which was prompted with authentic sentence pairs; outputs were manually reviewed to select only those most closely resembling natural language use. A small number of examples were collected from Jezikovna svetovalnica, based on real language queries submitted by speakers. The dataset is primarily intended for the development and evaluation of natural language processing tools for automatic error detection and correction for Slovene. It is available in TSV format, accompanied by a README document that describes its contents in more detail.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/llm4dh/en/work-packages/work-package-2/#task-2.3
dc.subject	grammatical error correction
dc.subject	language problem
dc.subject	error annotation
dc.subject	evaluation
dc.title	Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Špela Arhar Holdt spela.arharholdt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	7385 entries
size.info	128 classes
files.count	1
files.size	378996