Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.contributor.author Perinčić, Tea
dc.date.accessioned 2024-03-28T09:37:24Z
dc.date.available 2024-03-28T09:37:24Z
dc.date.issued 2024-03-05
dc.identifier.uri http://hdl.handle.net/11356/1765
dc.description The Mići Princ "text and speech" dialectal dataset is a word-aligned version of the translation of The Little Prince into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&Poke museum (http://skupnikatalog.nsk.hr/Record/nsk.NSK01001103632), both in form of a printed book and an audio book. The printed book is a translation of Antoine de Saint-Exupéry's "Le Petit Prince". The translation was performed by Tea Perinčić and the following additional translators (almost every character in the book uses a different micro-dialect): Davina Ivašić, Annamaria Grus, Maria Luisa Ivašić, Marin Miš, Josip Fafanđel, Glorija Fabijanić Jelović, Vlasta Juretić, Anica Pritchard, Tea Rosić, Dino Marković, Ilinka Babić, Jadranka Ajvaz, Vlado Simičić Vava, Irena Grdinić, and Ivana Marinčić. The audio book has been read by Zoran Prodanović Prlja, Davina Ivašić, Josip Fafanđel, Melita Nilović, Glorija Fabijanić Jelović, Albert Sirotich, Tea Rosić, Tea Perinčić, Dino Marković, Iva Močibob, Dražen Turina Šajeta, Vlado Simčić Vava, Ilinka Babić, Melita and Svetozar Nilović, and Ivana Marinčić. The master encoding of this "text and speech" dataset is available in form of json files (MP_13.json for the thirteenth chapter of the book), where the text, the turn-level alignment, and the word-level alignment to the audio are available. This master encoding is available from the MP.json.tgz archive for the text and alignment part, with the audio part of the master encoding located in the MP.wav.tgz archive. Besides this master encoding, an encoding focused on applications in automatic speech recognition (ASR) testing and adaptation, is available as well. Chapters 13 and 15 have been selected as testing data, and the text and audio reference files MP_13.asr.json and MP_15.asr.json contain segments split by speaker turns. The remainder of the dataset has been prepared in segments of length up to 20 seconds, ideal for training / fine-tuning current ASR systems. The text and audio reference data are available in the MP.asr.json.tgz archive, while the audio data are available in form of MP3 files in the MP.mp3.tgz archive. The dataset also includes an encoding for the Exmaralda speech editor (https://exmaralda.org), one file per chapter (MP_13.exb for the thirteenth chapter), available from the MP.exb.tgz archive. The wav files from the MP.wav.tgz archive are required if speech data are to be available inside Exmaralda. Speaker information is available in the speakers.json file, each speaker having a textual and wikidata reference to the location of the micro-dialect, as well as the name of the translator in the printed book and the reader in the audio book. An application of the dataset on fine-tuning the current (March 2024) SotA automatic speech recognition model for standard Croatian, whisper-v3-large (https://huggingface.co/classla/whisper-large-v3-mici-princ), shows for word error rate to drop from 35.43% to 16.83%, and the character error rate to drop from 11.54% to 3.95% (in-dataset test data, two seen speakers / micro-dialects, two unseen).
dc.language.iso hrv
dc.language.iso ckm
dc.publisher Jožef Stefan Institute
dc.relation.replaces http://hdl.handle.net/11356/1325
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject dialect
dc.subject spoken language
dc.subject speech recognition
dc.subject speech database
dc.subject speech transcription
dc.title The "Mići Princ" text and speech dataset of Chakavian micro-dialects
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/datasets/classla/Mici_Princ
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info 79 minutes
size.info 11591 words
size.info 547 turns
files.count 6
files.size 1115944215


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
MP.json.tgz
Size
222.22 KB
Format
Unknown
Description
Archive with JSON text and alignment info
MD5
984df31d6c5df7027dc188a714a17f15
 Download file
Icon
Name
MP.wav.tgz
Size
991.43 MB
Format
Unknown
Description
Archive with WAV data
MD5
f925597cabb7eef1f8a3bb0354db6ead
 Download file
Icon
Name
MP.asr.json.tgz
Size
37.6 KB
Format
Unknown
Description
Archive with JSON text and audio reference info for ASR
MD5
de303637a01714468ce8015e58664ac1
 Download file
Icon
Name
MP.mp3.tgz
Size
72.35 MB
Format
Unknown
Description
Archive with MP3 audio data for ASR
MD5
f59b8c5cf7b3c10da20bb1b30a7566bc
 Download file
Icon
Name
MP.exb.tgz
Size
214.03 KB
Format
Unknown
Description
Archive with Exmaralda files
MD5
68ff3eb67399d7a312be48073fd0945b
 Download file
Icon
Name
speakers.json
Size
3.11 KB
Format
Unknown
Description
JSON file with speaker information
MD5
0c280fcbdd78b7d04f276d579e9dcd5f
 Download file

Show simple item record