The "Mići Princ" text and speech dataset of Chakavian micro-dialects

Name: The "Mići Princ" text and speech dataset of Chakavian micro-dialects
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Rupnik, Peter; Perinčić, Tea

Prikaži enostavni zapis vnosa

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Perinčić, Tea
dc.date.accessioned	2024-03-28T09:37:24Z
dc.date.available	2024-03-28T09:37:24Z
dc.date.issued	2024-03-05
dc.identifier.uri	http://hdl.handle.net/11356/1765
dc.description	The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book. The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić. The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić. The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive. Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive. The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda. Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book. An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen).
dc.language.iso	hrv
dc.language.iso	ckm
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://www.sdjt.si/wp/wp-content/uploads/2024/09/JT-DH-2024_Ljubesic_Rupnik_Perincic.pdf
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.si/info/k-centre/
dc.subject	dialect
dc.subject	spoken language
dc.subject	speech recognition
dc.subject	speech database
dc.subject	speech transcription
dc.title	The "Mići Princ" text and speech dataset of Chakavian micro-dialects
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/datasets/classla/Mici_Princ
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info	79 minutes
size.info	11591 words
size.info	547 turns
files.count	6
files.size	1115944215