Multilingual IPTC Media Topic dataset EMMediaTopic 1.0

Name: Multilingual IPTC Media Topic dataset EMMediaTopic 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Kuzman, Taja; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2024-12-04T09:32:47Z
dc.date.available	2024-12-04T09:32:47Z
dc.date.issued	2024-12-02
dc.identifier.uri	http://hdl.handle.net/11356/1991
dc.description	The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes Media Topic hierarchical schema. The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against a manually-annotated test set showed that the model consistently achieves high performance, with an average macro-F1 score of 0.731 and a micro-F1 score of 0.722. Additionally, assessments of inter-annotator agreement on the test set revealed that the reliability of the GPT model used as a data annotator is comparable to that of human annotators. The EMMediaTopic dataset consists of 21,000 texts, divided into a training (20,000 instances) and a development set (1,000 instances), both of which have an identical distribution of labels. The dataset comprises news articles from the Catalan (ca), Croatian (hr), Greek (el), and Slovenian (sl) MaCoCu-Genre corpora (http://hdl.handle.net/11356/1969). For each language, a random sample of 5,250 texts classified under the "News" genre was extracted from the web corpus. Due to the limitations of the XLM-RoBERTa model fine-tuned on this dataset, the texts were truncated to the first 512 words. The dataset employs the following 17 top-level IPTC NewsCodes Media Topic (https://cv.iptc.org/newscodes/mediatopic) labels: 'arts, culture, entertainment and media', 'conflict, war and peace', 'crime, law and justice', 'disaster, accident and emergency incident', 'economy, business and finance', 'education', 'environment', 'health', 'human interest', 'labour', 'lifestyle and leisure', 'politics', 'religion', 'science and technology', 'society', 'sport', and 'weather'. The EMMediaTopic dataset is provided in JSONL format, where each text is accompanied by the following metadata: document_id (document ID from the MaCoCu-Genre corpus), lang (language code: ca, el, hr, or sl), GPT-IPTC-label (GPT-assigned IPTC topic label), and split (train or dev). This dataset was used for the development of the Multilingual IPTC news topic classifier (https://huggingface.co/classla/multilingual-IPTC-news-topic-classifier), a fine-tuned Transformer-based XLM-RoBERTa model that can be applied to any of the languages included in the XLM-RoBERTa pretraining dataset.
dc.language.iso	hrv
dc.language.iso	slv
dc.language.iso	ell
dc.language.iso	cat
dc.publisher	Jožef Stefan Institute
dc.relation	info:eu-repo/grantAgreement/EC/HE/101129751
dc.relation.isreferencedby	https://doi.org/10.1109/ACCESS.2025.3544814
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://emma.ijs.si/en/project-plans/
dc.subject	topic classification
dc.subject	IPTC Media Topic
dc.subject	news corpus
dc.title	Multilingual IPTC Media Topic dataset EMMediaTopic 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	European Union EC/HE/101129751 OSCARS - O.S.C.A.R.S. - Open Science Clusters’ Action for Research and Society euFunds info:eu-repo/grantAgreement/EC/HE/101129751
size.info	21000 texts
size.info	5449383 words
files.count	1
files.size	74766275