Show simple item record

 
dc.contributor.author Cinelli, Matteo
dc.contributor.author Pelicon, Andraž
dc.contributor.author Mozetič, Igor
dc.contributor.author Quattrociocchi, Walter
dc.contributor.author Kralj Novak, Petra
dc.contributor.author Zollo, Fabiana
dc.date.accessioned 2021-10-04T09:31:59Z
dc.date.available 2021-10-04T09:31:59Z
dc.date.issued 2021-10-01
dc.identifier.uri http://hdl.handle.net/11356/1450
dc.description We present an Italian YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the Italian YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 59,870 comments (IMSyPP_IT_YouTube_comments_train.csv) and an evaluation set with 10,536 comments (IMSyPP_IT_YouTube_comments_evaluation.csv). The dataset was annotated by 8 annotators with each comment being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_it. The dataset consists of the following fields: ID_Commento - YouTube ID of the comment ID_Video - YouTube ID of the video under which the comment was posted Testo - text of the comment Tipo - type of hate speech Target - the target of hate speech Additionally, we have included the Italian YouTube data (SR_YT_comments.csv) which was collected in the same period as the training data and was annotated using the aforementioned model. The automatically labeled data was used to analyze the relationship between hate speech and misinformation on Italian YouTube. The results of this analysis are presented in the associated paper. The analyzed data are represented with the following fields: ID_Commento - YouTube ID of the comment Label - automatically assigned label by the model is_questionable - the type of channel where the comment was collected from; the channels could either be categorized as spreading reliable or questionable information.
dc.language.iso ita
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://arxiv.org/abs/2105.14005
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://imsypp.ijs.si/
dc.subject hate speech
dc.subject misinformation
dc.subject YouTube
dc.title Italian YouTube Hate Speech Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Andraž Pelicon Andraz.Pelicon@ijs.si IJS
sponsor European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
files.count 3
files.size 123337660


 Files in this item

 Download all files in item (117.62 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
IMSyPP_IT_YouTube_comments_train.csv
Size
28.97 MB
Format
CSV file
Description
Italian Hate Speech YouTube Dataset - Training Set
MD5
9fb6156d0466c818434815f2327feb59
 Download file
Icon
Name
IMSyPP_IT_YouTube_comments_evaluation.csv
Size
5.25 MB
Format
CSV file
Description
Italian Hate Speech YouTube Dataset - Evaluation Set
MD5
3a798d42202bad3b76b769615a53149c
 Download file
Icon
Name
SR_YT_comments.csv
Size
83.4 MB
Format
CSV file
Description
Automatically Labeled Italian YouTube Comments
MD5
eb3807a634bfd7765c214eb213ee1651
 Download file

Show simple item record