ASR training dataset for Serbian JuzneVesti-SR v1.0

Name: ASR training dataset for Serbian JuzneVesti-SR v1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Rupnik, Peter; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Rupnik, Peter
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2022-08-29T20:09:35Z
dc.date.available	2022-08-29T20:09:35Z
dc.date.issued	2022-08-19
dc.identifier.uri	http://hdl.handle.net/11356/1679
dc.description	The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking). Original transcripts were collected with the help of the ReLDI Centre Belgrade.
dc.language.iso	srp
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/parlaspeech/tree/old-2022/juzne_vesti
dc.subject	speech recordings
dc.subject	speech recognition
dc.subject	automatic speech recognition
dc.subject	speech transcription
dc.subject	speech database
dc.title	ASR training dataset for Serbian JuzneVesti-SR v1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/classla/wav2vec2-xls-r-juznevesti-sr
contact.person	Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	50.55 hours
size.info	10811 entries
files.count	7
files.size	4982593957