Show simple item record

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Pori, Eva
dc.contributor.author Žagar, Aleš
dc.contributor.author Arhar Holdt, Špela
dc.date.accessioned 2022-10-31T09:32:57Z
dc.date.available 2022-10-31T09:32:57Z
dc.date.issued 2022-10-13
dc.identifier.uri http://hdl.handle.net/11356/1693
dc.description ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The textbooks, prepared by various authors, cover different subjects as is documented in the ccucbeniki-metadata file. The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in the field of language didactics and NLP. The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccucbeniki.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276).
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/prop/
dc.subject textbook corpus
dc.subject student reading
dc.subject language didactics
dc.title Corpus of Slovenian textbooks ccUčbeniki 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor University of Ljubljana I0-0022 Network of research and infrastructural centres nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 32 texts
size.info 2181602 tokens
files.count 2
files.size 47062549


 Files in this item

 Download all files in item (44.88 MB)
Icon
Name
ccucbeniki.conllu.zip
Size
21.6 MB
Format
application/zip
Description
Corpus in CoNLL-U format with TSV metadata
MD5
d3eda8559a3d584074ec62c584eb6aa0
 Download file  Preview
 File Preview  
  • ccucbeniki.conllu
    • ucb6.conllu4 MB
    • ucb1.conllu4 MB
    • ucb32.conllu2 MB
    • ucb30.conllu3 MB
    • ucb26.conllu19 MB
    • ucb21.conllu1 MB
    • ucb29.conllu2 MB
    • ucb24.conllu5 MB
    • ucb15.conllu2 MB
    • ucb10.conllu5 MB
    • ucb18.conllu13 MB
    • ucb13.conllu3 MB
    • ucb9.conllu2 MB
    • ucb4.conllu11 MB
    • ucb7.conllu3 MB
    • ucb2.conllu3 MB
    • ucb5.conllu2 MB
    • ucb31.conllu4 MB
    • ucb27.conllu3 MB
    • ucb22.conllu3 MB
    • ucb25.conllu4 MB
    • ucb20.conllu3 MB
    • ucb16.conllu10 MB
    • ucb11.conllu4 MB
    • ucb28.conllu5 MB
    • ucb19.conllu5 MB
    • ucb14.conllu2 MB
    • ucb23.conllu5 MB
    • ucb17.conllu4 MB
    • ucb12.conllu2 MB
    • ucb8.conllu2 MB
    • ucb3.conllu8 MB
    • ccucbeniki-metadata.tsv5 kB
Icon
Name
ccucbeniki.vert.zip
Size
23.28 MB
Format
application/zip
Description
Corpus in vertical format with LIST-type registry file
MD5
bd37fd936ad8cd093eadd8e2c9be372d
 Download file  Preview
 File Preview  

Show simple item record