This file gives explanation of the two files with the HyperVerb database. The content of this file is copy-pasted from the HyperVerb's wiki webpage. ########### modified wiki pages ######### - http://hyper-verb.ung.si/wiki/doku.php?id=start ##################### http://hyper-verb.ung.si/wiki/doku.php?id=material_in_the_databases ########### Material in the databases We give a description of the material included in the two sub-bases. BCS The verb selection for BCS was conducted using the corpora srWac, hrWaC, bsWaC and meWaC, all of which are part of Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. The criterion was frequency: the 3000 most frequent verbs from each of the corpora were included. The corpora of BSC had substantial overlap, which is why the number of included verbs is not 12000, as expected without any overlap, but 5300, with a number of verbs repeated in regional variants. Different shapes that the same verbs have in two or each of the varieties were introduced as separate entries and annotated as variants of one verb. Some typical examples of variants are ekavian and ijekavian versions (e.g. verovati and vjerovati 'to believe'), or versions emerging from using different native suffixes to adopt borrowed verbs (e.g. lajk-a-ti and lajk-ova-ti 'to like (on social media)'). Slovenian The list of 3000 most common Slovenian verbs was made using Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. For the purposes of this database, we used the Gigafida 2.0 corpora. You can find general information about the corpora here and its website here. Some general notes Items that got on the list due to mistakes in annotation in the corpus were excluded from our list and replaced by the next web on the list of most common verbs. One such example from Slovenian is ‘Hoče’. Hoče is indeed the 3. person singular form of the verb hoteti, but it is also a proper name of a Slovenian municipality. Since hoteti ‘to want’ was independently on the list, the form ‘Hoče’ was excluded. The list of verbs includes several homophonous verbs. Since the corpus is not annotated for meaning, homophonous verbs are counted as one verb. For example, in Slovenian, the verb brati can mean ‘read’ or ‘gather, collect’. In such cases the annotators annotated the verb for the properties associated with what they took to be the more frequent use of the verb. Same goes for prefixed versions (prebrati ‘to finish reading’ or ‘pick through’) but note that not all meanings appear with all prefixes (odbrati just ‘collect some items from a set, separate’). #################### http://hyper-verb.ung.si/wiki/doku.php?id=ra ########## Root Allomorphy In what follows we describe how root allomorphy is annotated in the Database. Root allomorphy is taken to be any unpredictable difference in the root (i.e. the material between the prefix and the theme vowel) between the non-finite and finite form of the verb. The exception to 'unpredictability' are defective verbs. Cases of predictable/productive phonological allomorphy (such as pisati, pišem 'to write, we write' in BCS and Slovenian) are not considered as cases of root allomorphy here. We give examples of predictable allomorphy below. Root allomorphy is annotated in two columns: - Column Root allomorphy (y/n): 1 is entered if there are any unpredictable differences (i.e. instances of root allomorphy). - Column Root allomorphs (list): The allomorphs of a verb are listed for verbs that exhibit root allomorphy. In order to determine whether a verb exhibits root allomorphy, we compared the finite and the non-finite forms of the verb (and not all verbs that have the same root). This is shown below: Brati 'to read' (BCS and Slovenian) Compare brati 'to read' with beremo 'we read'. Root allomorphy (y/n): 1; Root allomorphs (list): br, ber Vzeti 'to take.pfv' (Slovenian) Compare vzeti 'to take' with vzamemo 'we take'. (Crucially, do not compare vzeti with jemati 'to take.ipfv'.) Root allomorphy (y/n): 0 Jemati 'to take.impf' (Slovenian) Compare jemati 'to take' with jemljemo 'we take'. (Crucially, do not compare jemati with vzeti 'to take.pfv'.) Root allomorphy (y/n): 0 Examples of root allomorphy in BCS Infinitive 1.pl Gloss Root allomorphs (list) početi počnemo start če, čn razneti raznese blow up ne, nes žvakati žvaćemo chew žvak, žvać doći dođemo come č, đ Examples of root allomorphy in Slovenian: Infinitive 1.pl Gloss Root allomorphs (list) početi počnemo do č, čn obiti obidemo go-around i, id klati koljemo slaughter kl, kolj gnati ženemo goad gn, žen Examples of predictable allomorphy that we do not count as instances of root allomorphy (and are therefore marked with a 0 in the Root allomorphy (y/n)-column): peći, pečemo 'to bake, we bake' (kt to ć is productive in BCS) piti, pijemo 'to drink, we drink' (j epenthesis is productive) obuti, obujemo 'to put on shoes, we put on shoes' (j epenthesis is productive) gristi, grizemo 'to bite, we bite' (voicing assimilation is productive in Slovenian) pisati, pišemo 'to write, we write' (pisjemo to pišemo is a productive rule) krasti, krademo 'to steal, we steal'(kradti to krasti is a productive rule) kupovati, kupujemo 'to buy.impf, we buy.impf' (ova-uje is not counted as a root) Side note: Suppletion Some verbs have suppletive forms. In the Slovenian base, these are marked with 1 in the Root allomorphy (y/n) column, but in the Root allomorphy (list) column, we list the infinitive, 1sg./pl. present tense form and the l-participle. E.g., biti ‘to be’ has the forms: biti, sem/je, bil. ######################## URL: http://hyper-verb.ung.si/wiki/doku.php?id=theme_vowels ########### Theme vowels In what follows we describe how theme vowels were annotated in the Database. The following theme vowel classes were annotated for BCS Theme Vowel Class Example (inf, pres.1pl, gloss) N Verbs (total= 5300) % of all verbs a/a -- pitati, pitamo ‘ask’ / prov(j)erava-ti, prov(j)eravamo ‘check’ / dominirati, dominiramo ‘dominate’ 1702 32,1% i/i -- visiti, visimo ‘hang’ / graditi, gradimo ’build’ / odlaziti, odlazi-mo ’leave' 1603 30.2% a/je -- plakati, plačemo (plak-je-mo) ’cry’ / treptati, trepćemo(trept-je-mo) ‘blink’ / formulisati, formulišemo (formul-is-je-mo) ‘formulate' / ispitivati, ispitujemo ‘question’ / dod(j)eljivati, dod(j)eljujemo ‘assign’ / kovati, kujemo ‘mint’ / rezultovati, rezultujemo ‘result’ / davati, dajemo ‘give’ 1021 19,3% ∅/e -- pasti, pasemo ‘graze’ / bosti (← bod-∅-ti), bodemo ‘stab’ / piti, pijemo ‘drink’ / čuti, čujemo ‘hear’ / umr(ij)eti, umremo ‘die’ 315 5.9% u/e -- gurnuti, gurnemo ‘push’ / brinuti, brinemo ‘worry’ 258 4.9% (j)e/i -- gor(j)eti, gorimo ‘burn’ / crven(j)eti, crvenimo ‘become red’ / zreti, zrimo ‘ripen’ 184 3.5% ∅/ne -- stati, stanemo ‘stop’ / pasti (← pad-∅-ti), padnemo ‘fall’ 126 2,4% a/i -- strujati, strujimo ‘flow’ / ležati, ležimo ‘lie’ / zaspati, zaspimo ‘fall asleep’ 62 1,20% (j)e/(ij)e -- sm(j)eti, sm(ij)emo ‘be permitted’ 17 0,3% a/e -- (h)rvati, (h)rvemo ‘wrestle’ / grebati, grebemo ‘scratch’ 9 0,2% The following theme vowel classes were annotated for Slovenian (3000, defective excluded from the count): Theme Vowel Class Example (inf, pres.1pl, gloss) N Verbs (total= 3000) % of all verbs a/a -- delati, delamo ‘work’ / kopirati, kopiramo ‘copy’ / prepoznaati, prepoznavamo ‘recognise’ / dozdevati, dozdevamo ‘seem’ / pogovarjati, pogovarjamo ‘talk’ 1045 34.83% i/i -- deliti, delimo ‘share’ / graditi gradimo ’build’ / vstopiti, vstopimo ‘enter’ 863 28.77% a/je -- orati, orjemo ‘plough’ / sijati, sijemo ‘shine’\\prikazovati, prikazujemo ‘show’ / vzdrževati (← vzdrž-ov-a-ti), vzdrž-u-je-mo ‘abstain’ / skakati, skačemo (← skak-je-mo) ‘jump’ / pisati, pišemo (← pis-je-mo) ‘write’ / iskati, iščemo (← isk-je-mo) ‘seek’ 378 12.60% ∅/e -- pasti, pasemo ‘graze’ / pasti (← pad-∅-ti) pademo ’fall’ / odpreti, odpremo ‘open’ 285 9.5% i/e -- miniti, minemo ‘pass’ / iztegniti, iztegnemo ‘stretch out’ 144 4.8% e/i -- zveneti, zvenimo ‘sound’ / zoreti, zorimo ‘ripen’ 127 4.23% a/i -- bežati, bežimo ‘flee’ / bati, bojimo ‘be afraid’ 36 1.20% a/e -- brati, beremo ‘read’ / izzvati, izzovemo ‘challenge’ 46 1.53% ∅/ne -- stati, stanemo ‘cost’ / pričeti, pričnemo ‘start’ / odeti, odenemo ‘cover up’ 27 0.90% e/e -- umeti, umemo ‘understand’ / vedeti, vemo ‘know’ 47 1.57% Given that the annotated theme vowel classes deviate from the previously proposed theme vowel systems (see for example Toporišič 2004 for Slovenian and Pranjković & Silić 2005 for BCS) we add the guidelines that were used for annotation. For the purposes of the database, a theme vowel is taken to be the affix before the inflectional ending, which reflects the conjugation class of a group of verbs. Since the theme vowel can depend on the finiteness of the verb form, theme vowel classes are represented as pairs - the first vowel of the pair corresponds to the theme vowel in a non-finite form (we give the infinitive), the second in a finite form (for example the 1. person plural form as in the tables above or 3. person singular that is listed in the database, note that in BCS the division does not fully match the opposition finite - non-finite, but we approximate it this way). In order to determine the theme vowels, the conjugation of the verb and the root were considered (here root is used purely descriptively, as the pre-theme vowel part of the verb which can in fact be a morphologically complex unit). We give some examples below (these examples apply for both BCS and Slovenian). gledati gled-a-ti: gled- is taken to be the root as it can be found in other contexts (pogled ‘look, view.n’) gled-a-ti: -ti is the infinitival ending gled-a-mo: -mo is the 1st person plural ending Theme vowels: a in non-finite form, a in finite form; the theme vowel class is a, a držati drž-a-ti: drž- is taken to be the root as it can be found in other contexts (Slovenian: drž-a, drž-e ‘posture.nom, gen’) but is also the part of the verb that is not changing in the conjugation drž-a-ti: -ti is the infinitival ending drž-i-mo: -mo is the 1st person plural ending Theme vowels: a in non-finite form, i in finite form; the theme vowel class is a, i pisati pis-a-ti: pis- is taken to be the root as it can be found in other contexts (pismo ‘letter.n’) pis-a-ti: -ti is the infinitival ending piš-e-mo: - mo is the 1st person plural ending But: the theme vowel e would not induce the phonological change of the root, rather this phonological change indicates that the theme vowel is -je Theme vowels: a in non-finite form, je in finite form; the theme vowel class is a, je The general guideline for annotation and theme vowel classification was to use as few classes as possible to capture as much data as possible. Based on this, some items, which are sometimes taken to be theme vowels, were reconsidered and reanalyzed as a combination of a verbal affix and a theme vowel. We list these items for both sub-bases. BCS 1. -irati, -iramo (as in kopirati, kopiramo ‘to copy, we copy’): -ir-a-ti, -ir-a-mo; theme vowel class a,a 2. -nuti, -nemo (as in skoknuti, skoknemo ‘to jump, we jump’): -n-u-ti,-n-e-mo; theme vowel class u, e 3. -ovati, -ujemo (as in verovati, verujemo ‘to believe, we believe’): -ov-a-ti, -u-je-mo; theme vowel class a, je 4. -ivati, -ujemo (as in ukazivati, ukazivamo ‘to appear, we appear’): -iv-a-ti, u-je-mo, theme vowel class a, je 5. -vati, -jemo (as in prodavati, prodajemo ‘to sell, we sell): -v-a-ti, -je-mo; theme vowel class a, je 6. -isati, -išemo (as in spekulisati, spekulišemo ‘to speculate, we speculate): -is-a-ti, is-je-mo (only -je- will give the surface form -iš-e-mo), theme vowel class a, je Slovenian 1. -irati, -iramo (as in organizirati, organiziramo ‘to organize, we organize’): -ir-a-ti, -ir-a-mo; theme vowel class a, a 2. -avati, -avamo (as in prepoznavati, prepoznavamo ‘to recognise, we recognise’): -av-a-ti, -av-a-mo, theme vowel class a, a 3. -evati, -evamo (as in dozdevati, dozdevamo ‘to seem, we seem’): -ev-a-ti, -ev-a-mo; theme vowel class a, a 4. -ja, -ja (as in pogovarjati, pogovarjamo ‘to talk, we talk’): -j-a-ti, -j-a-mo; theme vowel class a, a 5. -niti, -nemo (as in kihniti, kihnemo ‘to sneeze, we sneeze’): -n-i-ti, -n-e-mo; theme vowel class i, e 6. -ovati/-evamo, -ujemo (as in oblikovati, oblikujemo ‘to design, we design’; razmnoževati, razmnožujemo ‘to duplicate, we duplicate), -ov/ev-a-ti, -u-je-mo; theme vowel class a, je Some special cases There are verbs that could be classified as belonging to several different theme classes. While this is very much a question of an adequate analysis, some (pre-theoretic) decisions were made for the purposes of the annotation. a, je or ∅, e class Verbs with root allomorphy (e.g. klati, koljemo ‘to slaughter, we slaughter’): these verbs can be analyses as being in the a, je class (kl-a-ti, kol-je-mo) or the ∅, e class (kla-∅-ti, kolj-e-mo). We add these verbs to the most frequent class they fit into. That is, since a, je is the third biggest class in both sub-bases, these verbs were annotated as a, je verbs. a, je or a, e In verbs with r and j in the root (e.g. orati, orjemo ‘to plough, we plough’ and smejati, smejemoSLO/smijati, smijemoBCS ‘to laugh, we laugh’) it is unclear whether the theme vowel class is a, je or a, e as both r and j are known to be able to absorb j. These cases were always annotated as a, je verbs. e, e or ∅,e For verbs which in BCS have (ij)e and in Slovenian have e in the infinitive (umr(ij)etiBCS / umretiSLO ‘to die), the past participle was also considered. If the verb have a ∅ in past participle (umr-∅-l-a ‘die.pst-ptc.fem’ in both) and e in the finite form (umr-e-mo ‘we die’), they are annotated as ∅, e verbs. This separates them from verbs, in which e also appears in the past participles (e.g. sporazumeti ‘to understand’, sporazumela ‘understand.pst-ptc.fem’, sporazumemo ‘we understand’) which were annotated as e, e verbs. Back to start page. ######################## http://hyper-verb.ung.si/wiki/doku.php?id=prosodic_prominence ############ Prosodic prominence For both Slovenian and BCS prosodic prominence is marked in across 8 columns – 4 for the infinitive, 4 for the 1st person plural present tense form. In order to annotate prominence (in Slovenian: stress, in BCS: High tone) each verb is cut into 4 parts, starting from the end of the word form. This is shown in the schema below for verb pogledati, pogledamo ‘to look, we look’, which has the same meaning and prominence placement in both Slovenian and BCS (note that the stress placement is not the same, as the stress is on po in BCS). The last line shows the annotation as found in the database. verb infinitive (po-gled-a-ti) 4 = all preceding syllables -- po 3 = base-final syllable -- gled 2 = tv -- a 1 = inf -- ti verb 1st person plural (po-gled-a-mo) 4 = all preceding syllables -- po 3 = base-final syllable -- gled 2 = tv -- a 1 = 1pl.present -- mo As evident from the schema, the order in which we go through the verb is reversed – we start with the ending, followed by the theme vowel and then the base, which is split into the base-final syllable and all the preceding syllables. (This is because base-prominence is almost always on the last syllable of the base). The stressed part of the verb is marked with 1. Since there is some variation between speakers, some verbs are marked with 1 and 2. We give more (language) specific instructions below. General instructions with examples We underline the prominent vowel for BCS and use caps for Slovenian. General instructions with examples: Infinitives Prominence marked on 1: The prominence is on the inflection (e.g. archaic Slovenian cves-∅-tI ‘to bloom’, Serbo-Croatian naći ‘to find’). Prominence marked on 2: The prominence falls on the theme vowel, e.g., Slovenian igr-A-ti 'to play', Serbo Croatian grebati 'to claw'. Prominence marked on 3: The prominence falls on the final syllable of the base, e.g., Slovenian glEd-a-ti 'to watch', Serbo-Croatian gledati 'to watch'. Prominence marked on 4: The prominence falls on a non-final syllable of the stem, e.g., Slovenian mAlic-a-ti 'to eat a snack', Serbo-Croatian dìrektorovati direktorov-a-ti 'to act as a director'. General instructions with examples: 1st person plural, present tense Prominence marked on 1: The prominence is on the inflection, e.g. Slovenian, vemO 'we know', Serbo-Croatian znamo 'we know'. Prominence marked on 2: The prominence falls on the theme vowel, e.g., Slovenian igr-A-mo 'we play', Serbo-Croatian računamo 'we calculate'. Prominence marked on 3: The prominence falls on the final syllable of the stem, e.g. Slovenian glEd-a-mo 'we watch', Serbo-Croatian gledamo 'we watch'. Prominence marked on 4: The prominence falls on a non-final syllable of the stem, e.g. Slovenian mAlic-a-mo 'we snack', Serbo-Croatian direktoru-je-mo 'we act as a director'. Example below shows how prominence is marked (for BCS and Slovenian) for the verb gledati 'to watch'. infinitive verb gledati: 1 = 0 2 = 0 3 = 1 4 = 0 --- 1st person plural 'gledamo': 1 = 0 2 = 0 3 = 1 4 = 0 Language specific instructions Language specific instructions: BCS For BCS, we count as prominent the syllable that carries the (rightmost) high tone in the word. For the purposes of the database, we transformed the traditional accent representations to those relevant for us. We underline the prominent vowel. If there is a falling accent (kȕća 'house', prȃvda 'justice') the stressed syllable is prominent (e.g. pȁdati is padati 'to fall', pȁdām is padam 'I fall', mȍlīm is molim 'to ask', kûmīm is kumim, kȃrtati is kartati, kȃrtām is kartam). If there is a rising accent (rúka 'arm', nòga 'leg'), the syllable following the stressed syllable is prominent (gráditi is graditi 'to build', lòmiti is lomiti 'to break', lòmīm is lomim 'I break', nàpadati is napadati 'to attack', nàpadām is napadam 'I attack', mòliti is moliti 'to ask', zàmolim is zamolim 'I ask'). If a form can have more than one common prosodic shape, all are entered, one by using 1, the next one using 2 etc. This means that some cells will have multiple numbers. Consider the verb kòristiti and korístiti 'to use' – 1 is used to describe kòristiti (1 is entered in all the relevant columns) and 2 for korístiti (2 is in all the relevant columns). This means that column 3 for Present Tense Prominence has a 1 and a 2 (because both forms have prominence on the same syllable: kòristīmo and kòrīstīmo). infinitive verb 'koristiti': 1 = 0 2 = 2 3 = 1 4 = 0 --- 1st person plural 'koristimo': 1 = 0 2 = 0 3 = 1 4 = 0 As for Length, Columns 3 for the Infinitive and for the Present Tense will have a 2, because only the version korístiti has a long syllable at the end of the base. Language specific instructions: Slovenian Slovenian verbs were only anotated for placement of stress and not for, e.g., tone. The annotation reflects the intuition of the annotators (native speakers of Slovenian) and the prominence marked in SSKJ2. When the two don’t match, the verb is annotated as having two stress patterns - the one marked with 1 is assumed (by the annotators) to be more common than the one marked with 2. If inflectional stress (i.e., position 1) is possible only in some persons and it is perceived as marked, this pattern was ignored. Example below shows annotation for the verb vprAš-a-ti/vpraš-A-ti 'to ask' (stress on the root was considered more common). There is no variation in the present form in this case. infinitive verb 'vprašati': 1 = 0 2 = 2 3 = 0 4 = 1 --- 1st person plural 'vprašamo': 1 = 0 2 = 0 3 = 1 4 = 0 ########################### http://hyper-verb.ung.si/wiki/doku.php?id=length ########### Length This section is only relevant for BCS. Vowel length is annotated across 8 columns – 4 for the infinitive, 4 for a present tense form. In order to annotate length, each verb is cut into 4 parts, starting from the end. This is shown in the schema below for poigravati, poigravamo ‘to play, we play’ and zavisiti, zavisimo ‘to depend, we depend’. infinitive verb 'poigravati': 4 = all preceding syllables -- poi -- 0 3 = base-final syllable -- graav -- 1 2 = tv -- a -- 0 1 = inf -- ti -- 0 --- 1st person plural 'poigravamo': 4 = all preceding syllables -- poi -- 0 3 = base-final syllable -- graav -- 1 2 = tv -- aa -- 1 1 = 1pl.present -- mo -- 0 infinitive verb 'zavisiti': 4 = all preceding syllables -- zaa -- 1 3 = base-final syllable -- vis -- 0 2 = tv -- i -- 0 1 = inf -- ti -- 0 --- 1st person plural 'zavisimo': 4 = all preceding syllables -- zaa -- 1 3 = base-final syllable -- vis -- 0 2 = tv -- ii -- 1 1 = 1pl.present -- mo -- 0 For the purposes of the database we transformed the traditional prosodic representations to those relevant for us. We underline the long vowel. Any syllable carrying one of the following diacritics is long: Diacritic Example ˆ mȃjka is majka 'mother' ´ rúka is ruka 'hand' ˉ nàpadām is napadam 'I attack' Infinitive Lenght marked on 1: The instruction was to enter 1 if the infinitive ending had a long vowel, which was never the case. Lenght marked on 2: The instruction was to enter 1 if the infinitive theme vowel was long, which was never the case. Lenght marked on 3: We entered 1 if the final syllable of the base is long, e.g., poigrávati – poigraav-a-ti 'to play'. Lenght marked on 4: We entered 1 if a non-final syllable of the base is long, e.g., závisiti – zaavis-i-ti 'to depend'. Present tense Lenght marked on 1: The instruction was to enter 1 if the 1st-person-plural ending had a long vowel, which was never the case. Lenght marked on 2: The instruction was to enter 1 if present-tense theme vowel was long, e.g., ȉgrāmo – igr-aa-mo 'we play', závisīmo – zaavis-ii-mo 'we depend on' , poìgrāvāmo – poigraav-aa-mo 'we play'. This was always the case. Lenght marked on 3: We entered 1 if the final syllable of the base is long, poìgrāvāmo – poigraav-aa-mo 'we play'. Lenght marked on 4: We entered 1 if a non-final syllable of the base is long, e.g., závisīmo – zaavis-ii-mo 'we depend'. If there are more possibilities, all are entered.