Prikaži enostavni zapis vnosa

 
dc.contributor.author Rupnik, Peter
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2022-08-29T20:09:35Z
dc.date.available 2022-08-29T20:09:35Z
dc.date.issued 2022-08-19
dc.identifier.uri http://hdl.handle.net/11356/1679
dc.description The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking). Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreplacedby http://hdl.handle.net/11356/1686
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/parlaspeech/tree/main/juzne_vesti
dc.subject speech recordings
dc.subject speech recognition
dc.subject automatic speech recognition
dc.subject speech transcription
dc.subject speech database
dc.title ASR training dataset for Serbian JuzneVesti-SR v1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/classla/wav2vec2-xls-r-juznevesti-sr
contact.person Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 50.55 hours
size.info 10811 entries
files.count 7
files.size 4982593957


 Datoteke v tem vnosu

Icon
Ime
JuzneVesti-SR.tgzaa
Velikost
1000 MB
Format
Neznano
Opis
Audio files in wav format, chunk 1/5
MD5
68b88b2188b73313498a3e20882c90e8
 Prenesi datoteko
Icon
Ime
JuzneVesti-SR.tgzab
Velikost
1000 MB
Format
Neznano
Opis
Audio files in wav format, chunk 2/5
MD5
6707fb94e2a57dc17f19067396eaa07b
 Prenesi datoteko
Icon
Ime
JuzneVesti-SR.tgzac
Velikost
1000 MB
Format
Neznano
Opis
Audio files in wav format, chunk 3/5
MD5
c31e035e6474b614480780055568b304
 Prenesi datoteko
Icon
Ime
JuzneVesti-SR.tgzad
Velikost
1000 MB
Format
Neznano
Opis
Audio files in wav format, chunk 4/5
MD5
27d7b4a09b9d7dbedcb194abf431a84a
 Prenesi datoteko
Icon
Ime
JuzneVesti-SR.tgzae
Velikost
738.95 MB
Format
Neznano
Opis
Audio files in wav format, chunk 5/5
MD5
81b08a2a5d37551f97efa282667f8f54
 Prenesi datoteko
Icon
Ime
JuzneVesti-SR.v1.0.jsonl
Velikost
12.82 MB
Format
Neznano
Opis
Transcriptions and metadata in JSON Lines format
MD5
cf4b41d3b59948a312f7035971b73258
 Prenesi datoteko
Icon
Ime
README.md
Velikost
3.29 KB
Format
Neznano
Opis
README file in Markdown format
MD5
e43f54467e3d037fb8c80da2f307fe45
 Prenesi datoteko

Prikaži enostavni zapis vnosa