Show simple item record

 
dc.contributor.author Batanović, Vuk
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Samardžić, Tanja
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2023-07-22T14:28:21Z
dc.date.available 2023-07-22T14:28:21Z
dc.date.issued 2023-06-13
dc.identifier.uri http://hdl.handle.net/11356/1843
dc.description The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, http://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).
dc.language.iso srp
dc.publisher Regional Linguistic Data Initiative Centre ReLDI
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.aclweb.org/anthology/W17-1407
dc.relation.replaces http://hdl.handle.net/11356/1200
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/reldi-data/SETimes.SRPlus
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.title Serbian linguistic training corpus SETimes.SR 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 176 texts
size.info 4384 sentences
size.info 97673 tokens
files.count 4
files.size 9861171


 Files in this item

 Download all files in item (9.4 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
set.sr.plus.conllup
Size
8.21 MB
Format
Unknown
Description
CoNLL-U-Plus dataset
MD5
5e5f3c9583418bbff1cbfc5e344bb21e
 Download file
Icon
Name
set.sr.plus-train.conllu.gz
Size
924.97 KB
Format
application/gzip
Description
CoNLL-U training dataset
MD5
39bfa2630c045e4d94aa7ee33e71a9d4
 Download file
Icon
Name
set.sr.plus-dev.conllu.gz
Size
150.68 KB
Format
application/gzip
Description
CoNLL-U development dataset
MD5
f890061031a2a51e58e946ed1eefbf59
 Download file
Icon
Name
set.sr.plus-test.conllu.gz
Size
142.91 KB
Format
application/gzip
Description
CoNLL-U test dataset
MD5
96b52b7b589e6b765b5e13f1c885abd8
 Download file

Show simple item record