dc.contributor.author | Dobranić, Filip |
dc.date.accessioned | 2024-10-07T13:08:20Z |
dc.date.available | 2024-10-07T13:08:20Z |
dc.date.issued | 2024-10-03 |
dc.identifier.uri | http://hdl.handle.net/11356/1945 |
dc.description | This is a corpus of 1915 "jokes of the day" ("šala dneva") published by the Slovenian news portal 24ur.com. The jokes were scraped from their archive on September 18th, 2024. The initial list is lightly curated: shorter texts found in the original collection were removed from the corpus since they appear to be illustration captions without the accompanying illustrations. Readers of the news portal vote on the jokes themselves with thumbs up and thumbs down buttons. The voting results are included as metadata with each joke. Several jokes have been published more than once. Each joke (distinguished based on exact text matches) is identified by a hash of its text and presents a list of voting results for every instance of its publication. The normalised_text field contains text with punctuation corrections. For now, this is limited to replacing '' (two consecutive apostrophes U+0027) with " (a single straight/dumb/vertical quotation mark U+0022). The former (two apostrophes) is consistently used in place of the latter in the original corpus. Based on the name ("Šala dneva" i.e. "Joke of the day") and observed frequency of posting during September 2024 we assume each entry corresponds to a day starting from the day of data collection counting backwards. Each voting event for has an associated estimated publication date calculated with the above algorithm. The jokes are linguistically annotated with CLASSLA-Stanza (https://github.com/clarinsi/classla), using the models for standard Slovenian. The JSONL file contains entries representing individual jokes containing: - a hash of the original joke text used for duplicate identification (key: hash) - original scraped text (key: original_text) - normalised text (key: normalised_text) - linguistically annotated normalised text in CoNLL-U format (key: processed_text) - a list of vote objects containing joke vote metadata (key: votes) - votes for (key: votes.for) - votes against (key: votes.against) - estimated dates of joke publication and voting (key: estimated_date) The corpus contains 16658 sentences, 129063 tokens, and 662 recognised named entities. |
dc.language.iso | slv |
dc.publisher | Institute of Contemporary History |
dc.rights | Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://dihur.si/muki/humor |
dc.subject | jokes |
dc.title | Corpus of daily jokes from the 24ur.com portal Šale24 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Filip Dobranić filip.dobranic@inz.si Institute of Contemporary History |
size.info | 16658 sentences |
size.info | 129063 tokens |
size.info | 1915 texts |
files.count | 1 |
files.size | 1782120 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Name
- sale_annotated.zip
- Size
- 1.7 MB
- Format
- application/zip
- Description
- Compressed jsonl file with 1915 jokes and metadata
- MD5
- 589039ff5ed4a68e79e2f5a42afff1f7