dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2022-08-29T20:09:35Z |
dc.date.available | 2022-08-29T20:09:35Z |
dc.date.issued | 2022-08-19 |
dc.identifier.uri | http://hdl.handle.net/11356/1679 |
dc.description | The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking). Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch). |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1686 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/clarinsi/parlaspeech/tree/main/juzne_vesti |
dc.subject | speech recordings |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.subject | speech transcription |
dc.subject | speech database |
dc.title | ASR training dataset for Serbian JuzneVesti-SR v1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/classla/wav2vec2-xls-r-juznevesti-sr |
contact.person | Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 50.55 hours |
size.info | 10811 entries |
files.count | 7 |
files.size | 4982593957 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- JuzneVesti-SR.tgzaa
- Size
- 1000 MB
- Format
- Unknown
- Description
- Audio files in wav format, chunk 1/5
- MD5
- 68b88b2188b73313498a3e20882c90e8
- Name
- JuzneVesti-SR.tgzab
- Size
- 1000 MB
- Format
- Unknown
- Description
- Audio files in wav format, chunk 2/5
- MD5
- 6707fb94e2a57dc17f19067396eaa07b
- Name
- JuzneVesti-SR.tgzac
- Size
- 1000 MB
- Format
- Unknown
- Description
- Audio files in wav format, chunk 3/5
- MD5
- c31e035e6474b614480780055568b304
- Name
- JuzneVesti-SR.tgzad
- Size
- 1000 MB
- Format
- Unknown
- Description
- Audio files in wav format, chunk 4/5
- MD5
- 27d7b4a09b9d7dbedcb194abf431a84a
- Name
- JuzneVesti-SR.tgzae
- Size
- 738.95 MB
- Format
- Unknown
- Description
- Audio files in wav format, chunk 5/5
- MD5
- 81b08a2a5d37551f97efa282667f8f54
- Name
- JuzneVesti-SR.v1.0.jsonl
- Size
- 12.82 MB
- Format
- Unknown
- Description
- Transcriptions and metadata in JSON Lines format
- MD5
- cf4b41d3b59948a312f7035971b73258
- Name
- README.md
- Size
- 3.29 KB
- Format
- Unknown
- Description
- README file in Markdown format
- MD5
- e43f54467e3d037fb8c80da2f307fe45