# 24sata Articles Dataset (version 1.0) This is a CSV dataset created from the news archive from the 24sata, the biggest Croatian news publisher. The dataset was created from the state of the database from the 2019-06-30 at the end of the day and it includes all previous articles. Articles are published as early as 2006, but the more significant number of published articles starts in April 2007. ## Description of the Dataset The dataset consists of 11 columns and 657806 rows. Each row represents a single news article published on the 24sata news portals. Besides the 'www.24sata.hr', the biggest news portal, articles form other niche portals affiliated with 24sata are also included. Columns: * 'article_id' - Public id of the article on the new site. The article can be accessed by concatenating site url and article_id. For example to access the article with article_id 614684, you can access it on 'www.24sata.hr/--614684'. This id is, by itself, not unique across the dataset - articles from different portals can share the same article_id. * 'site' - The location of the portals where the article came from. There are eight different portals covering topics of daily news, to the more focused portals about automotive technologies and trends, health and wellness, culinary trends and recipes, or lifestyle advices. * 'title' - The title of the news article. * 'lead' - Lead text, a short introduction to the content of an article. Can be empty. * 'content' - The content of the news article, contains the bulk of the text. Can be empty if the whole article could fit in the lead text. * 'tags' - Tags, zero or more, separated with a '|' character. Article tags are chosen by the author of the article. * 'section' - The main section of the news portal where the article was posted (does not need to be set). Most frequent section is 'Vijesti' (News). * 'subsection' - The subsection of the section where the article was posted (does not need to be set). Each section can have multiple subsections. * 'authors' - Article authors, zero or more, separated with a '|' character. Author does not need to sign the article if he chooses not to so this can be empty. * 'published_from' - A date when this article appeared on the portal. Journalists can write the article in advance and pick a future date and time when it will appear on the site. Due to this strategy, the 'published_from' can be much later than the 'date_created'. * 'date_created' - A date when this article was originally written. For all articles published before 2nd Feb 2010 the 'date_created' is set to 2nd Feb 2010 - this is the date when the portal was redesigned and the database with news articles recreated. ## Distribution by years The dataset contains articles from 2007-2019 ('published_from' column), with 4 additional articles from 2006. The distribution is as follows. 2006 , 4 2007 , 22848 2008 , 42947 2009 , 47929 2010 , 45672 2011 , 41301 2012 , 44531 2013 , 50424 2014 , 53062 2015 , 59133 2016 , 73005 2017 , 66887 2018 , 72951 2019 , 37112 ## Notes on Preprocessing Articles are stored in the database as an HTML element together with any javascript neded to render the article. We remove all HTML tags, javascript and embedded content from the article and retain only the text. For further preprocessing we recommend replacing: * no-break space (u'\\u00A0') with regular space * zero width space (u'\\u200B') with regular space * en dash (u'\\u2013') with minus sign * em dash (u'\\u2014') with minus sign * horizontal bar (u'\\u2015') with minus sign For the rest of the rare characters, it is simplest to replace them with space character (if between non-whitespace characters) or just drop them. ## Licence This dataset is available as a part of the EMBEDDIA project (http://www.embeddia.eu). The copyright remains with the publisher ; see the accompanying licence for conditions on use and distribution. ## Associated publication: @inproceedings{HackashopResources2021, title = "{EMBEDDIA} Tools, Datasets and Challenges: {R}esources and Hackathon Contributions", author = "Pollak, Senja and Robnik {\v S}ikonja, Marko and Purver, Matthew and Boggia, Michele and Shekhar, Ravi and Pranji{\'c}, Marko and Salmela, Salla and Krustok, Ivar and Paju, Tarmo and Linden, Carl-Gustav and Lepp{\"a}nen, Leo and Zosa, Elaine and Ul{\v c}ar, Matej and Freienthal, Linda and Traat, Silver and Cabrera-Diego, Luis Adri{\'a}n and Martinc, Matej and Lavra{\v c}, Nada and {\v S}krlj, Bla{\v z} and {\v Z}nidar{\v s}i{\v c}, Martin and Pelicon, Andra{\v z} and Koloski, Boshko and Podpe{\v c}an, Vid and Kranjc, Janez and Sheehan, Shane and Boros, Emanuela and Moreno, Jose and Doucet, Antoine and Toivonen, Hannu", booktitle = "Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation", month = apr, year = "2021", publisher = "Association for Computational Linguistics", } ## Contact: Marko Pranjić: marko@entropia.hr