dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Perinčić, Tea |
dc.date.accessioned | 2024-03-28T09:37:24Z |
dc.date.available | 2024-03-28T09:37:24Z |
dc.date.issued | 2024-03-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1765 |
dc.description | The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book. The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić. The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić. The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive. Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive. The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda. Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book. An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen). |
dc.language.iso | hrv |
dc.language.iso | ckm |
dc.publisher | Jožef Stefan Institute |
dc.relation.replaces | http://hdl.handle.net/11356/1325 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.si/info/k-centre/ |
dc.subject | dialect |
dc.subject | spoken language |
dc.subject | speech recognition |
dc.subject | speech database |
dc.subject | speech transcription |
dc.title | The "Mići Princ" text and speech dataset of Chakavian micro-dialects |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/datasets/classla/Mici_Princ |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
size.info | 79 minutes |
size.info | 11591 words |
size.info | 547 turns |
files.count | 6 |
files.size | 1115944215 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- MP.json.tgz
- Size
- 222.22 KB
- Format
- Unknown
- Description
- Archive with JSON text and alignment info
- MD5
- 984df31d6c5df7027dc188a714a17f15
- Name
- MP.wav.tgz
- Size
- 991.43 MB
- Format
- Unknown
- Description
- Archive with WAV data
- MD5
- f925597cabb7eef1f8a3bb0354db6ead
- Name
- MP.asr.json.tgz
- Size
- 37.6 KB
- Format
- Unknown
- Description
- Archive with JSON text and audio reference info for ASR
- MD5
- de303637a01714468ce8015e58664ac1
- Name
- MP.mp3.tgz
- Size
- 72.35 MB
- Format
- Unknown
- Description
- Archive with MP3 audio data for ASR
- MD5
- f59b8c5cf7b3c10da20bb1b30a7566bc
- Name
- MP.exb.tgz
- Size
- 214.03 KB
- Format
- Unknown
- Description
- Archive with Exmaralda files
- MD5
- 68ff3eb67399d7a312be48073fd0945b
- Name
- speakers.json
- Size
- 3.11 KB
- Format
- Unknown
- Description
- JSON file with speaker information
- MD5
- 0c280fcbdd78b7d04f276d579e9dcd5f