| dc.contributor.author | Arhar Holdt, Špela |
| dc.contributor.author | Antloga, Špela |
| dc.contributor.author | Gantar, Polona |
| dc.contributor.author | Munda, Tina |
| dc.contributor.author | Robida, Nejc |
| dc.contributor.author | Zgonc, Matjaž |
| dc.date.accessioned | 2025-09-30T16:20:25Z |
| dc.date.available | 2025-09-30T16:20:25Z |
| dc.date.issued | 2025-09-30 |
| dc.identifier.uri | http://hdl.handle.net/11356/2052 |
| dc.description | DASSLE 1.0 (Dataset of Authentic and Synthetic Slovene Language Errors) comprises 7,385 manually prepared entries, each consisting of a Slovene sentence containing a single, specific language problem, its corrected version, and metadata including both coarse- and fine-grained correction classifications, as well as the source of the example. Language problems are divided into five top-level categories: spelling, orthography, morphology, vocabulary, and syntax. These are further specified using 128 fine-grained error types, aligned with the typology developed for the Šolar 3.0 corpus. The typology is explained at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/introduction-to-tags and in more detail in the annotation guidelines at https://wiki.cjvt.si/books/11-developmental-corpus-solar/page/annotation-guidelines. The examples in DASSLE 1.0 were sourced from four distinct origins, combining both authentic and synthetic data creation. From Šolar 3.0, the corpus of student writing with teacher-provided corrections, sentences were manually reviewed and edited to contain only one clearly defined language problem. In Gigafida 2.0, the reference corpus of standard written Slovene, examples were either manually corrected or deliberately corrupted to introduce typical deviations from the current norm. Synthetic examples were generated using GPT-4o, which was prompted with authentic sentence pairs; outputs were manually reviewed to select only those most closely resembling natural language use. A small number of examples were collected from Jezikovna svetovalnica, based on real language queries submitted by speakers. The dataset is primarily intended for the development and evaluation of natural language processing tools for automatic error detection and correction for Slovene. It is available in TSV format, accompanied by a README document that describes its contents in more detail. |
| dc.language.iso | slv |
| dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
| dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://www.cjvt.si/llm4dh/en/work-packages/work-package-2/#task-2.3 |
| dc.subject | grammatical error correction |
| dc.subject | language problem |
| dc.subject | error annotation |
| dc.subject | evaluation |
| dc.title | Dataset of Authentic and Synthetic Slovene Language Errors DASSLE 1.0 |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | other |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Špela Arhar Holdt spela.arharholdt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
| sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| size.info | 7385 entries |
| size.info | 128 classes |
| files.count | 1 |
| files.size | 378996 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Name
- DASSLE-v1.zip
- Size
- 370.11 KB
- Format
- application/zip
- Description
- README and TSV data
- MD5
- 3c669371908b8d468b528b1df27ad461
- DASSLE-v1
- DASSLE-v1-DATA.tsv1 MB
- DASSLE-v1-README.md3 kB