SenseGraph a graph-like structure of word senses of most common words of the standard Croatian language, obtained by relying on human-provided lexical substitutes for target words in context. SenseGraph is encoded in the Lexical Markup Framework (LMF; ISO 24613:2008) format.
SenseGraphs consists of SenseCells, which are clusters of same-sense words obtained by grouping of words based on the similarity of their lexical substitution sets and the contexts they appear in. SenseCells can be thought of as Synsets in standard computational lexicographic terminology, albeit they exhibit more variability, which can be attributed to sense modulations in specific contexts. SenseCells are linked to each other based on loose semantic relatedness.
In total, the resource covers 649 Croatian words across three different part-of-speech tags: nouns, verbs, and adjectives. More specifically, the resource contains 4,172 sentences across 230 nouns, 3,288 sentences across 200 verbs, and 4,116 sentences across 219 adjectives. Those sentences were then clustered using a lexical-substitution-based clustering method, yielding 2,877 synsets. The sentences were sampled from the SETimes.HR and hrWaC corpora.
Total number of sentences: 11,576
Total number of syncells: 2,877
Total number of words: 649