dc.contributor.author |
Kosem, Iztok |
dc.contributor.author |
Čibej, Jaka |
dc.contributor.author |
Dobrovoljc, Kaja |
dc.contributor.author |
Erjavec, Tomaž |
dc.contributor.author |
Ljubešić, Nikola |
dc.contributor.author |
Ponikvar, Primož |
dc.contributor.author |
Šinkec, Mihael |
dc.contributor.author |
Krek, Simon |
dc.date.accessioned |
2024-12-04T09:17:27Z |
dc.date.available |
2024-12-04T09:17:27Z |
dc.date.issued |
2024-12-04 |
dc.identifier.uri |
http://hdl.handle.net/11356/2001 |
dc.description |
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-11 covers the period from January 2019 to November 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si).
This version adds texts from November 2024. |
dc.language.iso |
slv |
dc.publisher |
Jožef Stefan Institute |
dc.publisher |
Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreferencedby |
http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf |
dc.relation.isreferencedby |
https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf |
dc.relation.isreferencedby |
https://doi.org/10.4312/slo2.0.2023.1.161-188 |
dc.relation.replaces |
http://hdl.handle.net/11356/1981 |
dc.relation.isreplacedby |
http://hdl.handle.net/11356/2007 |
dc.source.uri |
https://sled.ijs.si/ |
dc.subject |
monitor corpus |
dc.subject |
news corpus |
dc.subject |
universal dependencies |
dc.subject |
temporal trends |
dc.subject |
topic attribution |
dc.title |
Monitor corpus of Slovene Trendi 2024-11 |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
no |
branding |
CLARIN.SI data & tools |
contact.person |
Iztok Kosem iztok.kosem@ijs.si Jožef Stefan Institute |
sponsor |
Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds |
sponsor |
University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds |
sponsor |
ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info |
1014924429 tokens |
size.info |
850327266 words |
size.info |
2583568 texts |
files.count |
0 |
files.size |
0 |