dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2024-12-04T09:32:47Z |
dc.date.available | 2024-12-04T09:32:47Z |
dc.date.issued | 2024-12-02 |
dc.identifier.uri | http://hdl.handle.net/11356/1991 |
dc.description | The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes Media Topic hierarchical schema. The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against a manually-annotated test set showed that the model consistently achieves high performance, with an average macro-F1 score of 0.731 and a micro-F1 score of 0.722. Additionally, assessments of inter-annotator agreement on the test set revealed that the reliability of the GPT model used as a data annotator is comparable to that of human annotators. The EMMediaTopic dataset consists of 21,000 texts, divided into a training (20,000 instances) and a development set (1,000 instances), both of which have an identical distribution of labels. The dataset comprises news articles from the Catalan (ca), Croatian (hr), Greek (el), and Slovenian (sl) MaCoCu-Genre corpora (http://hdl.handle.net/11356/1969). For each language, a random sample of 5,250 texts classified under the "News" genre was extracted from the web corpus. Due to the limitations of the XLM-RoBERTa model fine-tuned on this dataset, the texts were truncated to the first 512 words. The dataset employs the following 17 top-level IPTC NewsCodes Media Topic (https://cv.iptc.org/newscodes/mediatopic) labels: 'arts, culture, entertainment and media', 'conflict, war and peace', 'crime, law and justice', 'disaster, accident and emergency incident', 'economy, business and finance', 'education', 'environment', 'health', 'human interest', 'labour', 'lifestyle and leisure', 'politics', 'religion', 'science and technology', 'society', 'sport', and 'weather'. The EMMediaTopic dataset is provided in JSONL format, where each text is accompanied by the following metadata: document_id (document ID from the MaCoCu-Genre corpus), lang (language code: ca, el, hr, or sl), GPT-IPTC-label (GPT-assigned IPTC topic label), and split (train or dev). This dataset was used for the development of the Multilingual IPTC news topic classifier (https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier), a fine-tuned Transformer-based XLM-RoBERTa model that can be applied to any of the languages included in the XLM-RoBERTa pretraining dataset. |
dc.language.iso | hrv |
dc.language.iso | slv |
dc.language.iso | ell |
dc.language.iso | cat |
dc.publisher | Jožef Stefan Institute |
dc.relation | info:eu-repo/grantAgreement/EC/HE/101129751 |
dc.relation.isreferencedby | https://doi.org/10.48550/arXiv.2411.19638 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://emma.ijs.si/en/project-plans/ |
dc.subject | topic classification |
dc.subject | IPTC Media Topic |
dc.subject | news corpus |
dc.title | Multilingual IPTC Media Topic dataset EMMediaTopic 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | European Union EC/HE/101129751 OSCARS - O.S.C.A.R.S. - Open Science Clusters’ Action for Research and Society euFunds info:eu-repo/grantAgreement/EC/HE/101129751 |
size.info | 21000 texts |
size.info | 5449383 words |
files.count | 1 |
files.size | 74766275 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- EMMediaTopic-1.0.jsonl
- Size
- 71.3 MB
- Format
- Unknown
- Description
- EMMediaTopic 1.0 dataset
- MD5
- 017183c9be08151b792e4a29c434ff8c