Slovenian Equity Evaluation Corpus EEC-SL 1.0

Name: Slovenian Equity Evaluation Corpus EEC-SL 1.0
License: https://creativecommons.org/licenses/by/4.0/

Vintar, Špela

dc.contributor.author	Vintar, Špela
dc.date.accessioned	2025-10-05T11:16:58Z
dc.date.available	2025-10-05T11:16:58Z
dc.date.issued	2025-09-18
dc.identifier.uri	http://hdl.handle.net/11356/2049
dc.description	The EEC-SL dataset is a localised and adapted version of the Equity Evaluation Corpus (EEC, Kiritchenko and Mohammad, 2018, https://aclanthology.org/S18-2005/). It consists of 8,640 sentences which were automatically generated to evaluate social bias in sentiment analysis systems. The sentences are created from 22 templates, with each template containing a reference to <person>, where the slot can be filled either by a name (female and male, Slovenian and non-Slovenian), or by a generic noun phrase (e.g., moja sestra [my sister], ta moški [this man], moj oče [my dad]). The second and third variables that are present in 7 out of 11 templates are <emotional state word> and <emotional situation word>, which can be filled by words expressing four basic emotional states: Anger, Fear, Joy and Sadness. Template example: Zaradi te situacije se <person_F_1> počuti <emotional_state_word_S_4>. The selection of names was conceptualised to represent the current social reality in Slovenia, so that the foreign names were carefully selected to match the demographic situation in the country, and at the same time be perceived as non-Slovenian. Hence, we selected 10 female and 10 male Slovenian names, 6 female and 6 male names from former Yugoslavia, 2 female and 2 male names from EU countries, and 2 female and 2 male names from non-EU countries. All the names were selected from the registry of names available at the Statistical Office of Slovenia. The emotional state and emotional situation words were selected to represent various intensities of the basic emotions. Their emotional valence was taken from SloEmoLex (http://hdl.handle.net/11356/1875). The templates, names, generic forms and adjectives have been linguistically adapted to Slovenian which is a highly inflected language with agreement in number, gender and case. Thus, instead of the original 11 templates in English, Slovenian uses 22 templates as each English example was translated into a female and male version, depending on the gender of the <person> variable. Along similar lines, each variable can appear in different cases and numbers, which is reflected in the sentence templates. More details are given in the README file. The dataset was originally designed to tease out bias in sentiment analysis systems, because it allows for testing the hypothesis that a system should equally rate the intensity of the emotion expressed by two sentences that differ only in the gender/nationality of the person mentioned (e.g., "Anja je jezna." vs. "Snježana je jezna.").
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	sentiment analysis
dc.subject	social bias
dc.subject	gender bias
dc.title	Slovenian Equity Evaluation Corpus EEC-SL 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Špela Vintar spela.vintar@ijs.si Jožef Stefan Institute
size.info	8640 sentences
files.count	1
files.size	151034