Corpus of Slovenian textbooks ccUčbeniki 1.0

Name: Corpus of Slovenian textbooks ccUčbeniki 1.0
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Kosem, Iztok; Pori, Eva; Žagar, Aleš; Arhar Holdt, Špela

dc.contributor.author	Kosem, Iztok
dc.contributor.author	Pori, Eva
dc.contributor.author	Žagar, Aleš
dc.contributor.author	Arhar Holdt, Špela
dc.date.accessioned	2022-10-31T09:32:57Z
dc.date.available	2022-10-31T09:32:57Z
dc.date.issued	2022-10-13
dc.identifier.uri	http://hdl.handle.net/11356/1693
dc.description	ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The textbooks, prepared by various authors, cover different subjects as is documented in the ccucbeniki-metadata file. The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in the field of language didactics and NLP. The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccucbeniki.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276).
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/prop/
dc.subject	textbook corpus
dc.subject	student reading
dc.subject	language didactics
dc.title	Corpus of Slovenian textbooks ccUčbeniki 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	University of Ljubljana I0-0022 Network of research and infrastructural centres nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	32 texts
size.info	2181602 tokens
files.count	2
files.size	47062549

Files in this item

Download all files in item (44.88 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: ccucbeniki.conllu.zip
Size: 21.6 MB
Format: application/zip
Description: Corpus in CoNLL-U format with TSV metadata
MD5: d3eda8559a3d584074ec62c584eb6aa0

Download file Preview

File Preview

ccucbeniki.conllu
- ucb6.conllu4 MB
- ucb1.conllu4 MB
- ucb32.conllu2 MB
- ucb30.conllu3 MB
- ucb26.conllu19 MB
- ucb21.conllu1 MB
- ucb29.conllu2 MB
- ucb24.conllu5 MB
- ucb15.conllu2 MB
- ucb10.conllu5 MB
- ucb18.conllu13 MB
- ucb13.conllu3 MB
- ucb9.conllu2 MB
- ucb4.conllu11 MB
- ucb7.conllu3 MB
- ucb2.conllu3 MB
- ucb5.conllu2 MB
- ucb31.conllu4 MB
- ucb27.conllu3 MB
- ucb22.conllu3 MB
- ucb25.conllu4 MB
- ucb20.conllu3 MB
- ucb16.conllu10 MB
- ucb11.conllu4 MB
- ucb28.conllu5 MB
- ucb19.conllu5 MB
- ucb14.conllu2 MB
- ucb23.conllu5 MB
- ucb17.conllu4 MB
- ucb12.conllu2 MB
- ucb8.conllu2 MB
- ucb3.conllu8 MB
- ccucbeniki-metadata.tsv5 kB

Name: ccucbeniki.vert.zip
Size: 23.28 MB
Format: application/zip
Description: Corpus in vertical format with LIST-type registry file
MD5: bd37fd936ad8cd093eadd8e2c9be372d

Download file Preview

File Preview

ccucbeniki.vert
- ccucbeniki.vert186 MB
- README.txt1 kB
- ccucbeniki.regi135 B

Show simple item record

Files in this item

Partners

Partners

Repository