Show simple item record

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-12-04T09:32:47Z
dc.date.available 2024-12-04T09:32:47Z
dc.date.issued 2024-12-02
dc.identifier.uri http://hdl.handle.net/11356/1991
dc.description The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes Media Topic hierarchical schema. The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against a manually-annotated test set showed that the model consistently achieves high performance, with an average macro-F1 score of 0.731 and a micro-F1 score of 0.722. Additionally, assessments of inter-annotator agreement on the test set revealed that the reliability of the GPT model used as a data annotator is comparable to that of human annotators. The EMMediaTopic dataset consists of 21,000 texts, divided into a training (20,000 instances) and a development set (1,000 instances), both of which have an identical distribution of labels. The dataset comprises news articles from the Catalan (ca), Croatian (hr), Greek (el), and Slovenian (sl) MaCoCu-Genre corpora (http://hdl.handle.net/11356/1969). For each language, a random sample of 5,250 texts classified under the "News" genre was extracted from the web corpus. Due to the limitations of the XLM-RoBERTa model fine-tuned on this dataset, the texts were truncated to the first 512 words. The dataset employs the following 17 top-level IPTC NewsCodes Media Topic (https://cv.iptc.org/newscodes/mediatopic) labels: 'arts, culture, entertainment and media', 'conflict, war and peace', 'crime, law and justice', 'disaster, accident and emergency incident', 'economy, business and finance', 'education', 'environment', 'health', 'human interest', 'labour', 'lifestyle and leisure', 'politics', 'religion', 'science and technology', 'society', 'sport', and 'weather'. The EMMediaTopic dataset is provided in JSONL format, where each text is accompanied by the following metadata: document_id (document ID from the MaCoCu-Genre corpus), lang (language code: ca, el, hr, or sl), GPT-IPTC-label (GPT-assigned IPTC topic label), and split (train or dev). This dataset was used for the development of the Multilingual IPTC news topic classifier (https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier), a fine-tuned Transformer-based XLM-RoBERTa model that can be applied to any of the languages included in the XLM-RoBERTa pretraining dataset.
dc.language.iso hrv
dc.language.iso slv
dc.language.iso ell
dc.language.iso cat
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/HE/101129751
dc.relation.isreferencedby https://doi.org/10.48550/arXiv.2411.19638
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://emma.ijs.si/en/project-plans/
dc.subject topic classification
dc.subject IPTC Media Topic
dc.subject news corpus
dc.title Multilingual IPTC Media Topic dataset EMMediaTopic 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor European Union EC/HE/101129751 OSCARS - O.S.C.A.R.S. - Open Science Clusters’ Action for Research and Society euFunds info:eu-repo/grantAgreement/EC/HE/101129751
size.info 21000 texts
size.info 5449383 words
files.count 1
files.size 74766275


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
EMMediaTopic-1.0.jsonl
Size
71.3 MB
Format
Unknown
Description
EMMediaTopic 1.0 dataset
MD5
017183c9be08151b792e4a29c434ff8c
 Download file

Show simple item record