Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Batanović, Vuk
dc.contributor.author Miličević, Maja
dc.contributor.author Samardžić, Tanja
dc.date.accessioned 2023-04-07T15:38:52Z
dc.date.available 2023-04-07T15:38:52Z
dc.date.issued 2023-04-07
dc.identifier.uri http://hdl.handle.net/11356/1794
dc.description ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation mistakes corrected, and is now encoded in the CoNLL-U-Plus format, as are other linguistic training datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
dc.relation.replaces http://hdl.handle.net/11356/1240
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/reldi-data/reldi-normtagner-sr
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject part-of-speech tagging
dc.subject lemmatisation
dc.subject named entities
dc.subject manual annotation
dc.subject TEI
dc.title Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info 3748 texts
size.info 6899 sentences
size.info 92271 tokens
files.count 4
files.size 9233057


 Files in this item

 Download all files in item (8.81 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
reldi-normtagner-sr.conllup
Size
7.75 MB
Format
Unknown
Description
CoNLL-U-Plus dataset
MD5
c6f3799e38de1209e6bd6b59e888ec11
 Download file
Icon
Name
reldi-normtagner-sr-train.conllu.gz
Size
868.31 KB
Format
application/gzip
Description
CoNLL-U morphosyntax training dataset
MD5
eb8e4339d01d2c301c1a038e59fe74c4
 Download file
Icon
Name
reldi-normtagner-sr-dev.conllu.gz
Size
109.69 KB
Format
application/gzip
Description
CoNLL-U morphosyntax development dataset
MD5
a2a03ada14d4a914c70e497e169ea28b
 Download file
Icon
Name
reldi-normtagner-sr-test.conllu.gz
Size
107.73 KB
Format
application/gzip
Description
CoNLL-U morphosyntax test dataset
MD5
a8010fe58b49afd411a9b53c8fe03346
 Download file

Show simple item record