Show simple item record

 
dc.contributor.author Purver, Matthew
dc.contributor.author Shekhar, Ravi
dc.contributor.author Pranjić, Marko
dc.contributor.author Pollak, Senja
dc.contributor.author Martinc, Matej
dc.date.accessioned 2021-05-19T16:47:00Z
dc.date.available 2021-05-19T16:47:00Z
dc.date.issued 2021-04-19
dc.identifier.uri http://hdl.handle.net/11356/1410
dc.description The 24sata news portal consists of a portal with daily news and several smaller portals covering news from specific topics, such as automotive news, health, culinary content, and lifestyle advice. The dataset contains over 650,000 articles in Croatian from 2007 to 2019, as well as assigned tags. Description of the Dataset The dataset consists of 11 columns and 657806 rows. Each row represents a single news article published on the 24sata news portals. Besides the 'www.24sata.hr', the biggest news portal, articles from other niche portals affiliated with 24sata are also included. Columns: 'article_id' - Public id of the article on the new site. The article can be accessed by concatenating the site URL and article_id. For example, to access the article with article_id 614684, you can access it on 'www.24sata.hr/--614684'. This id is, by itself, not unique across the dataset - articles from different portals can share the same article_id. 'site' - The location of the portals where the article came from. There are eight different portals covering topics of daily news, to the more focused portals about automotive technologies and trends, health and wellness, culinary trends and recipes, or lifestyle advices. 'title' - The title of the news article. 'lead' - Lead text, a short introduction to the content of an article. Can be empty. 'content' - The content of the news article, contains the bulk of the text. Can be empty if the whole article could fit in the lead text. 'tags' - Tags, zero or more, separated with a '|' character. Article tags are chosen by the author of the article. 'section' - The main section of the news portal where the article was posted (does not need to be set). The most frequent section is 'Vijesti' (News). 'subsection' - The subsection of the section where the article was posted (does not need to be set). Each section can have multiple subsections. 'authors' - Article authors, zero or more, separated with a '|' character. The author does not need to sign the article if he chooses not to so this can be empty. 'published_from' - A date when this article appeared on the portal. Journalists can write the article in advance and pick a future date and time when it will appear on the site. Due to this strategy, the 'published_from' can be much later than the 'date_created'. 'date_created' - A date when this article was originally written. For all articles published before 2nd Feb 2010 the 'date_created' is set to 2nd Feb 2010 - this is the date when the portal was redesigned and the database with news articles recreated.
dc.language.iso hrv
dc.publisher Styria Media Group
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label PUB
dc.source.uri http://embeddia.eu/
dc.subject news corpus
dc.subject croatian news articles
dc.subject 24sata news articles
dc.title 24sata news article archive 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Matthew Purver m.purver@qmul.ac.uk Queen Mary University
contact.person Marko Pranjic marko@entropia.hr Styria Media Group
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info 657806 texts
files.count 2
files.size 1355765049


 Files in this item

Icon
Name
Readme.md
Size
5 KB
Format
Unknown
Description
Readme
MD5
e49e192aad9c473c9c873649962b4d1e
 Download file
Icon
Name
Styria-articles.csv
Size
1.26 GB
Format
CSV file
Description
Styria Article Data
MD5
e7f0997d1e1361f473aaab6296e33818
 Download file

Show simple item record