Show simple item record

 
dc.contributor.author Lenardič, Jakob
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Zupan, Katja
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2022-12-06T13:46:18Z
dc.date.available 2022-12-06T13:46:18Z
dc.date.issued 2022-12-06
dc.identifier.uri http://hdl.handle.net/11356/1732
dc.description Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs, forums and news comments. The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is composed of two parts, the older (texts to 2016) and smaller (65,000 words) Janes Tag 2.1, and the tweet-only newer (2022, 125,000 words) Janes RSDO. Only the Janes Tag 2.1 part is annotated with named entities and with classification of the texts according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness. The data is available in the source TEI encoding and in derived CoNLL-U format. Both contain JOS/MULTEXT-East morphosyntactic descriptions as well as Universal Dependencies morphological features. Compared to the previous version, this one corrects some errors, updates the encoding, and adds Janes-RSDO. The first version of this corpus is described in: FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2020. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation. https://doi.org/10.1007/s10579-018-9425-z Note that a related corpus, Janes-Norm 3.0 (http://hdl.handle.net/11356/1733), is also available. It contains Janes-Tag 3.0 and an additional subcorpus with manually checked sentences, tokens and normalised words but only automatically assigned lemmas and MULTEXT-East MSDs.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag
dc.relation.isreferencedby https://doi.org/10.1007/s10579-018-9425-z
dc.relation.replaces http://hdl.handle.net/11356/1238
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject part-of-speech tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.subject named entities
dc.title CMC training corpus Janes-Tag 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jakob Lenardič jakob.lenardic@inz.si Institute of Contemporary History
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 14913 texts
size.info 217774 tokens
size.info 190268 words
files.count 2
files.size 9054146
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=janes_tag30
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=janes_tag30&struct_attr_stats=1


 Files in this item

 Download all files in item (8.63 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Janes-Tag.3.0.TEI.zip
Size
2.74 MB
Format
application/zip
Description
Corpus in TEI format
MD5
66ff9bc7b8c1d0147a77314bfaf3f4c8
 Download file  Preview
 File Preview  
  • Janes-Tag.3.0.TEI
    • janes-rsdo.xml14 MB
    • janes-tag.xml7 MB
    • schema
      • tei_clarin_example.xml48 kB
      • tei_clarin.rnc321 kB
      • tei_clarin_schema.xml70 kB
      • README.md525 B
      • tei_clarin.rng683 kB
    • Janes-Tag.3.0.xml16 kB
    • 00README.txt1 kB
Icon
Name
Janes-Tag.3.0.CoNLL-U.zip
Size
5.9 MB
Format
application/zip
Description
Corpus in CoNLL-U format
MD5
e1166f9047438817b558d435f55def5b
 Download file  Preview
 File Preview  
  • Janes-Tag.3.0.CoNLL-U
    • janes-tag.jos.conllu7 MB
    • janes-tag.ud.conllu5 MB
    • janes-rsdo.jos.connlu14 MB
    • tei2conllu.xsl26 kB
    • janes-rsdo.ud.connlu11 MB
    • 00README.txt1 kB

Show simple item record