dc.contributor.author | Krek, Simon |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.date.accessioned | 2015-05-17T19:14:37Z |
dc.date.available | 2015-05-17T19:14:37Z |
dc.date.issued | 2013-09-30 |
dc.identifier.uri | http://hdl.handle.net/11356/1029 |
dc.description | The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from the jos1M corpus forming a training corpus with 500,000 words, manually checked and annotated on the levels of tokenization, segmentation, morphosyntactic tagging, syntactic dependency parsing and named entities. The ssj500k corpus uses the JOS morphosyntactic tagset with 1,902 tags and dependencies with 10 labels. The part of the corpus annotated with dependency relations contains 11,411 sentences, named entities are annotated in the original jos100k part of the corpus. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1052 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://eng.slovenscina.eu/tehnologije/ucni-korpus |
dc.subject | tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Training corpus ssj500k 1.3 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
size.info | 500295 words |
size.info | 586248 tokens |
files.count | 3 |
files.size | 18564568 |
Files in this item
Download all files in item (17.7 MB)This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Name
- ssj500kv1_3.zip
- Size
- 7.81 MB
- Format
- application/zip
- Description
- Corpus encoded in TEI-like format with annotations in Slovenian
- MD5
- 9a29d9b0f97f521249c5cf6f0990426d

- Name
- ssj500kv1_3-en.tei.zip
- Size
- 7.81 MB
- Format
- application/zip
- Description
- Corpus encoded in TEI-like format with annotations in English
- MD5
- b38204a484d114a2499581c3cce3a3e1

- Name
- ssj500kv1_3-sl.conll.zip
- Size
- 2.08 MB
- Format
- application/zip
- Description
- Corpus encoded in CoNLL-X format
- MD5
- ae86a11faa2c51a844e47479838c3d25