Show simple item record

 
dc.contributor.author Koržinek, Danijel
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-02-02T10:44:22Z
dc.date.available 2024-02-02T10:44:22Z
dc.date.issued 2024-02-01
dc.identifier.uri http://hdl.handle.net/11356/1686
dc.description The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.
dc.language.iso pol
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/2022.parlaclarin-1.16
dc.relation.replaces http://hdl.handle.net/11356/1679
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.eu/parlamint
dc.subject parliamentary debates
dc.subject speech recordings
dc.subject speech database
dc.subject speech recognition
dc.subject automatic speech recognition
dc.subject speech transcription
dc.title Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info 535465 entries
size.info 3635354 seconds
size.info 1010 hours
files.count 4
files.size 63067921190
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=parlaspeech_pl
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_pl


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
ParlaSpeech-PL.v1.0.jsonl.gz
Size
124.96 MB
Format
application/gzip
Description
Corpus text in gzipped JSON Lines format
MD5
a186d99cf96f15be6898cf86bc261f34
 Download file
Icon
Name
ParlaSpeech-PL.v1.0.part1.tgz
Size
27.87 GB
Format
Unknown
Description
Speech in FLAC format, part 1
MD5
cbef0242706ee876bd27e7e151c69ba2
 Download file
Icon
Name
ParlaSpeech-PL.v1.0.part2.tgz
Size
30.74 GB
Format
Unknown
Description
Speech in FLAC format, part 2
MD5
95332c745a4c79a56dcc78bc34b30cb1
 Download file
Icon
Name
README.txt
Size
1 KB
Format
Text file
Description
Description of the corpus format
MD5
53d3b9c770e2ed6f4cbff71b6d4f267e
 Download file  Preview
 File Preview  
Parliamentary spoken corpus of Polish ParlaSpeech-PL v1.0
http://hdl.handle.net/11356/1686

The ParlaSpeech-PL.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:

id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fro . . .
                                            

Show simple item record