Title: Slovenian Semantic Lexicon sloWNet-USAS 1.0 Author: Mojca Brglez, Faculty of Arts, University of Ljubljana; Kristina Pahor de Maiti Tekavčič, Institute for Contemporary History Contact: mojca.brglez@ff.uni-lj.si About: This lexicon is an extension of the Slovenian semantic lexicon sloWNet 3.1 (http://hdl.handle.net/11356/1026) which is enriched with semantic tags following the USAS ontology. The USAS ontology (Piao, 2005) is part of the UCREL semantic analysis system and is used for general language semantic description (https://ucrel.lancs.ac.uk/usas/). It consists of 21 major semantic fields (e.g., PHYSICAL ATTRIBUTES [O4]) and more than 400 semantic subcategories (e.g., Temperature [O4.6], Temperature : Cold [O4.6-]). The lexicon contains 41,135 entries in a tabular format. Please note that the tags were automatically mapped and have not (yet) been manually validated. Structure: The lexicon is a tab-separated file containing the following columns: - LEXEME = lemma or multi-word expression from sloWNet - POS = part-of-speech; one of: n (noun), v (verb), a (adjective), b (adverb) - TAG = the tag of the lexeme's most basic/literal semantic domain - TAG_DESCRIPTION = the Slovenian description of the tag - match_type = the source of the mapping; see ALGORITHM below - ALL_CANDIDATE_TAGS = all the candidate tags from English words that the algoritm chose from, separated by a semi-colon(;) - ALL_CANDIDATE_TAG_DESCRIPTIONS = the Slovenian description of the candidate tags, separated by a semi-colon(;) ALGORITHM: [1] Perfect first sense mapping: match_type: First, for a given Slovenian word (and part-of-speech (POS)), our algorithm maps candidate semantic tags from the English words which (a) have the same PoS as the Slovenian word, and (b) have the given Slovenian word as a translation of their first sense (sense 1 in WordNet). If mapping(s) exist, the algorithm proceeds to step [7] to assign the semantically most similar tag. [2] Perfect other senses mapping: match_type: If no mapping is found in step 1, the algorithm continues on to find mappings for all other senses (sense 2+). If such mapping(s) exist, the algorithm proceeds to step [7] to assign the semantically most similar tag. [3] Partial first sense mapping: match_type: If no mapping is found in step 2, it continues on to find sense 1 mappings for English words with a different PoS from that of the given Slovenian word. If such mapping(s) exist, the algorithm proceeds to step [7] to assign the semantically most similar tag. [4] Partial other senses mapping: match_type: If no mapping is found in step 3, it continues on to find mappings of all other senses (sense 2+) for English words with a different PoS from that of the given Slovenian word. If such mapping(s) exist, the algorithm proceeds to step [7] to assign the semantically most similar tag. [5] Related words mapping: match_type: If not even a partial mapping is found (the English word is not included in the USAS lexicon), the algorithm searches the morphological lexicon (Čibej et al., 2020; http://hdl.handle.net/11356/1386) to find a morphologically related word (i.e., sharing the same semantic root) of the same PoS (e.g., noun kamen > noun kamenje). It then repeats the search for the appropriate English words as per the steps 1–4, collects the semantic tags of these words in the USAS lexicon and continues to step 7. match_type: If a morphologically similar word of the same POS is not found, the algorithm extends the search to related words of a different POS and repeats the steps 1-4. [6] Nearest neighbours mapping: match_type: If no mappings are found, the algorithm uses CLARIN fast text word embeddings (Ljubešič and Erjavec, 2018; http://hdl.handle.net/11356/1204) to find the top 3 nearest neighbours of the candidate word (it considers only those with the same part-of-speech) among the words present in sloWNet, and collects their candidate tags according to steps 1-5. After collecting the candidate tags, it proceeds to step 7. [7] Closest tag selection: When a mapping (or multiple mappings) is found, the algorithm tries to ascribe the most relevant and most basic domain to a given Slovenian word from all the collected tag candidates. To achieve that, we use CLARIN fast text word embeddings to embed both the word and the semantic tag descriptions, and compute cosine similarity scores between the word and the individual tags. The final semantic tag is the one that is most similar to the word in terms of cosine similarity. USAS TAG CONCRETENESS SCORES: Additionally, in order to facilitate metaphor analysis, the USAS_sl_conc.tsv tagset also includes concreteness scores of semantic categories. The category score is calculated by averaging the concreteness score from the English concreteness ratings (Köper and Schulte im Walde, 2017; https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/abst-ratings-en/) of all the words that have this particular tag as their first semantic tag in the English USAS lexicon. The concreteness scores were manually checked and, when the subdomain scores largely deviated from the relevant top-level domain, manually amended. References: - Piao, Scott S.L., Archer, Dawn, Mudraya, Olga, Rayson, Paul, Garside, Roger, McEnery, Tony, and Wilson, Andrew, 2005, A Large Semantic Lexicon for Corpus Annotation. In Proceedings of the Corpus Linguistics 2005 conference, July 14-17, Birmingham, UK. Proceedings from the Corpus Linguistics Conference Series 1: 1, https://ucrel.lancs.ac.uk/people/paul/publications/cl2005_estlex.pdf. - Čibej, Jaka; Arhar Holdt, Špela and Krek, Simon, 2020, List of word relations from the Sloleks 2.0 lexicon 1.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1386. - Ljubešić, Nikola and Erjavec, Tomaž, 2018, Word embeddings CLARIN.SI-embed.sl 1.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1204. - Köper, Maximilian, and Schulte im Walde, Sabine, 2017, Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications (SENSE). Valencia, Spain.