| dc.contributor.author | Ljubešić, Nikola |
| dc.contributor.author | Markoski, Filip |
| dc.contributor.author | Markoska, Elena |
| dc.contributor.author | Erjavec, Tomaž |
| dc.date.accessioned | 2021-05-12T13:50:22Z |
| dc.date.available | 2021-05-12T13:50:22Z |
| dc.date.issued | 2021-05-05 |
| dc.identifier.uri | http://hdl.handle.net/11356/1427 |
| dc.description | This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on October 17th 2020. The text was extracted from the dumps with the process documented at https://github.com/clarinsi/classla-wikipedia, and linguistic annotation was performed with the classla package (https://pypi.org/project/classla/), on all levels available for a specific language, with the Bosnian and Serbo-Croatian Wikipedias processed with the standard Croatian models. |
| dc.language.iso | bos |
| dc.language.iso | hrv |
| dc.language.iso | mkd |
| dc.language.iso | cnr |
| dc.language.iso | srp |
| dc.language.iso | hbs |
| dc.language.iso | slv |
| dc.publisher | Jožef Stefan Institute |
| dc.relation.isreferencedby | https://aclanthology.org/2021.ranlp-1.104.pdf |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://github.com/clarinsi/classla-wikipedia |
| dc.subject | comparable corpus |
| dc.subject | Wikipedia |
| dc.title | Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0 |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
| sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
| sponsor | European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other |
| size.info | 1928450 articles |
| size.info | 37677016 sentences |
| size.info | 486258862 tokens |
| files.count | 7 |
| files.size | 5409270773 |
| featuredService.kontext | Bulgarian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bg |
| featuredService.kontext | Bosnian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bs |
| featuredService.kontext | Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_hr |
| featuredService.kontext | Macedonian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_mk |
| featuredService.kontext | Serbo-Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sh |
| featuredService.kontext | Slovenian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sl |
| featuredService.kontext | Serbian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sr |
| featuredService.noske | Bulgarian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bg |
| featuredService.noske | Bosnian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bs |
| featuredService.noske | Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_hr |
| featuredService.noske | Macedonian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_mk |
| featuredService.noske | Serbo-Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sh |
| featuredService.noske | Slovenian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sl |
| featuredService.noske | Serbian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sr |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- classlawiki-bg.conllu.gz
- Size
- 1.09 GB
- Format
- application/gzip
- Description
- Bulgarian Wikipedia
- MD5
- ed4f11056abc4f265e2ba995961683e0
- Name
- classlawiki-bs.conllu.gz
- Size
- 258.13 MB
- Format
- application/gzip
- Description
- Bosnian Wikipedia
- MD5
- 3f82c98d4ff85a1c41eade4ba194c585
- Name
- classlawiki-hr.conllu.gz
- Size
- 745.75 MB
- Format
- application/gzip
- Description
- Croatian Wikipedia
- MD5
- a60256fccf9203845ab27565e0ee7362
- Name
- classlawiki-mk.conllu.gz
- Size
- 422.35 MB
- Format
- application/gzip
- Description
- Macedonian Wikipedia
- MD5
- 76320c89294557ef8115911ed1cab18f
- Name
- classlawiki-sh.conllu.gz
- Size
- 753.29 MB
- Format
- application/gzip
- Description
- Serbo-Croatian Wikipedia
- MD5
- cb33f6f5f16ac36c97279b429d0c23e3
- Name
- classlawiki-sl.conllu.gz
- Size
- 620.25 MB
- Format
- application/gzip
- Description
- Slovenian Wikipedia
- MD5
- 43b88ff0d5d6b9b11d6a49e933819243
- Name
- classlawiki-sr.conllu.gz
- Size
- 1.22 GB
- Format
- application/gzip
- Description
- Serbian Wikipedia
- MD5
- 4229eb9ac932e95498ad9e33f0453c5a