Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Batanović, Vuk
dc.contributor.author Miličević, Maja
dc.contributor.author Samardžić, Tanja
dc.date.accessioned 2023-04-07T15:31:38Z
dc.date.available 2023-04-07T15:31:38Z
dc.date.issued 2023-04-07
dc.identifier.uri http://hdl.handle.net/11356/1793
dc.description ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation errors corrected and the dataset encoded in the CoNLL-U-Plus format, similar to other manually annotated linguistic datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).
dc.language.iso hrv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
dc.relation.replaces http://hdl.handle.net/11356/1241
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/reldi-data/reldi-normtagner-hr
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject part-of-speech tagging
dc.subject lemmatisation
dc.subject named entities
dc.subject manual annotation
dc.subject TEI
dc.title Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info 3871 texts
size.info 7939 sentences
size.info 89855 tokens
files.count 4
files.size 8956597


 Files in this item

 Download all files in item (8.54 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
reldi-normtagner-hr.conllup
Size
7.47 MB
Format
Unknown
Description
CoNLL-U-Plus dataset
MD5
16e73df1872f953b9e8be7bb6e48671c
 Download file
Icon
Name
reldi-normtagner-hr-train.conllu.gz
Size
875.55 KB
Format
application/gzip
Description
CoNLL-U morphosyntax training dataset
MD5
8461e7854d7443a3e046679f575b6fe1
 Download file
Icon
Name
reldi-normtagner-hr-dev.conllu.gz
Size
109.12 KB
Format
application/gzip
Description
CoNLL-U morphosyntax development dataset
MD5
87220592f73b929a9b5909ea97be6250
 Download file
Icon
Name
reldi-normtagner-hr-test.conllu.gz
Size
110.76 KB
Format
application/gzip
Description
CoNLL-U morphosyntax test dataset
MD5
6df74d4b761900f7369072983dff3f2e
 Download file

Show simple item record