Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Zupan, Katja
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2019-09-12T11:14:01Z
dc.date.available 2019-09-12T11:14:01Z
dc.date.issued 2019-09-11
dc.identifier.uri http://hdl.handle.net/11356/1238
dc.description Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 2.0, this version corrects some minor errors in NER annotation and introduces, in addition to MULTEXT-East morphosyntactic descriptions, also Universal Dependencies morphological features and the corpus in CoNLL-U format. The UD features are also included in the vert file. The first version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. 2016. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation. https://rdcu.be/7RX4 Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag
dc.relation.isreferencedby https://rdcu.be/7RX4
dc.relation.replaces http://hdl.handle.net/11356/1123
dc.relation.isreplacedby http://hdl.handle.net/11356/1732
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject part-of-speech tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.subject named entities
dc.title CMC training corpus Janes-Tag 2.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 2958 texts
size.info 75276 tokens
files.count 7
files.size 5955006
featuredService.kontext Search|https://www.clarin.si/kontext/first_form?corpname=janes_tag
featuredService.noske Search|https://www.clarin.si/ske/#dashboard?corpname=janes_tag&struct_attr_stats=1


 Files in this item

 Download all files in item (5.68 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Janes-Tag.TEI.zip
Size
2.84 MB
Format
application/zip
Description
Corpus in TEI format
MD5
041425def8d389f6bb972c45ac0b630d
 Download file  Preview
 File Preview  
  • Janes-Tag.TEI
    • msd-fslib-sl.xml487 kB
    • janes.tag.xml13 kB
    • Schema
      • tei_clarin_doc.xml7 MB
      • tei_clarin.zip87 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • 00README.txt206 B
    • janes.tag.body.xml13 MB
Icon
Name
Janes-Tag.vert.zip
Size
1.02 MB
Format
application/zip
Description
Corpus in derived vertical (Sketch Engine / CQP) format
MD5
8499c9058d8d8ea9eb66b8498dd3536e
 Download file  Preview
 File Preview  
Icon
Name
Janes-Tag.conllu.zip
Size
879 KB
Format
application/zip
Description
Corpus in derived CONLL-U format
MD5
f2919990c4abe52a663e23f86c0c25f4
 Download file  Preview
 File Preview  
Icon
Name
Janes-smernice-v1.0.pdf
Size
339.69 KB
Format
PDF
Description
Annotation Guidelines - main (in Slovene)
MD5
39845f938d68b0e3330259eab31c6043
 Download file
Icon
Name
SlovenianNER-eng-v1.0.pdf
Size
229.47 KB
Format
PDF
Description
Annotation Guidelines - named entities (in English)
MD5
dd055da0cfe8a914485899dc0d41c2a2
 Download file
Icon
Name
SlovenianNER-slv-v1.0.pdf
Size
201.77 KB
Format
PDF
Description
Annotation Guidelines - named entities (in Slovene)
MD5
9e6eab070ef053dc005fd145ec0827aa
 Download file
Icon
Name
RASLAN16-Janes.pdf
Size
210.85 KB
Format
PDF
Description
RASLAN'16 paper describing the corpus
MD5
7487b904191a41f8cb38bbdfe12ba14e
 Download file

Show simple item record