Show simple item record

 
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Gantar, Polona
dc.contributor.author Čibej, Jaka
dc.contributor.author Pori, Eva
dc.contributor.author Terčon, Luka
dc.contributor.author Munda, Tina
dc.contributor.author Žitnik, Slavko
dc.contributor.author Robida, Nejc
dc.contributor.author Blagus, Neli
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.contributor.author Zupan, Katja
dc.contributor.author Kuzman, Taja
dc.contributor.author Kavčič, Teja
dc.contributor.author Škrjanec, Iza
dc.contributor.author Marko, Dafne
dc.contributor.author Jezeršek, Lucija
dc.contributor.author Zajc, Anja
dc.date.accessioned 2022-12-05T11:47:15Z
dc.date.available 2022-12-05T11:47:15Z
dc.date.issued 2022-12-05
dc.identifier.uri http://hdl.handle.net/11356/1747
dc.description The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with some parts also containing further manually verified annotations. The morphosyntactic tags and (where present) syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The corpus is composed of several parts: * ssj500k-syn (200,320 words): the syntactically annotated part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also named entity, verbal multiword expression and semantic role label annotations; * ssj500k-tag.xml (299,927 words): the PoS tagged part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also verbal multiword expressions annotations; * Ambiga (13,929 words): this corpus has been constructed to contain many potentially lemma/PoS ambiguous words in order to help in the training of taggers and lemmatizers * ElexisWSD (27,091 words): the Slovenian part of the "Parallel sense-annotated corpus ELEXIS-WSD 1.0" (http://hdl.handle.net/11356/1674) with manually checked lemmatisation, PoS tagging, and syntactic parses; contains also named entity and semantic role label annotations; * SentiCoref (340,401 words): the "Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0" (http://hdl.handle.net/11356/1285) with manually checked lemmatisation and PoS tagging; contains also named entity and coreference chain annotation. The annotations follow: (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/. The vocabulary of (1) is provided in the back element and (3)-(5) as taxonomies in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version ssj500k 2.3, this version has significantly more text, corrects various errors in annotation, annotates more text with syntactic parses, adds new types of annotation, updates the TEI encoding, provides CoNLL-U files with text metadata and distinguishes UD-type CoNLL-U files from JOS-type CoNLL-U files.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.replaces http://hdl.handle.net/11356/1434
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://rsdo.slovenscina.eu/en/language-resources
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.subject verbal multiword expressions
dc.subject semantic role labelling
dc.subject CONLL-U
dc.subject coreference resolution
dc.title Training corpus SUK 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
contact.person Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 2913 texts
size.info 48594 sentences
size.info 881668 words
size.info 1025639 tokens
files.count 2
files.size 45235424
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=suk10
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=suk10&struct_attr_stats=1


 Files in this item

 Download all files in item (43.14 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
SUK.TEI.zip
Size
19.94 MB
Format
application/zip
Description
Corpus in TEI format
MD5
d9777bc76a5db135b6e9ee3cf12c9e71
 Download file  Preview
 File Preview  
  • SUK.TEI
    • SUK.xml65 kB
    • senticoref.xml62 MB
    • schema
      • tei_clarin.zip87 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • ssj500k-tag.xml42 MB
    • elexiswsd.xml11 MB
    • ssj500k-syn.xml69 MB
    • mte-fvlib.xml1 MB
    • 00README.txt1 kB
    • ambiga.xml2 MB
Icon
Name
SUK.CoNLL-U.zip
Size
23.2 MB
Format
application/zip
Description
Corpus in CoNLL-U format
MD5
66b3db82d3356bbf80746d0b94e75d16
 Download file  Preview
 File Preview  
  • SUK.CoNLL-U
    • senticoref.ud.conllu28 MB
    • ambiga.jos.conllu1 MB
    • tei2conllu.xsl26 kB
    • ssj500k-syn.ud.conllu17 MB
    • ssj500k-tag.jos.conllu31 MB
    • elexiswsd.ud.conllu2 MB
    • ambiga.ud.conllu1 MB
    • ssj500k-tag.ud.conllu23 MB
    • senticoref.jos.conllu36 MB
    • 00README.txt1 kB
    • elexiswsd.jos.conllu3 MB
    • ssj500k-syn.jos.conllu22 MB

Show simple item record