Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Klubička, Filip
dc.date.accessioned 2016-05-12T15:32:55Z
dc.date.available 2016-05-12T15:32:55Z
dc.date.issued 2016-05-12
dc.identifier.uri http://hdl.handle.net/11356/1063
dc.description The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Serbian vs. Croatian). Version 1.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 1.1 contains newer and better linguistic annotations.
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://nlp.ffzg.hr/resources/corpora/srwac/
dc.subject web corpus
dc.subject lemmatisation
dc.title Serbian web corpus srWaC 1.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nljubesi@gmail.com Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
size.info 554627647 tokens
size.info 25636542 sentences
size.info 1353238 texts
files.count 6
files.size 3767312759
featuredService.kontext Search|https://www.clarin.si/kontext/first_form?corpname=srwac
featuredService.noske Search|https://www.clarin.si/ske/#dashboard?corpname=srwac


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
srWaC1.1.01.xml.gz
Size
651.1 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
ef12a8606bd33a81cd7afb8f2d9299d9
 Download file
Icon
Name
srWaC1.1.02.xml.gz
Size
648.98 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
0aa1446dd5707f3c08025ccca8463844
 Download file
Icon
Name
srWaC1.1.03.xml.gz
Size
647.66 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
4b7ee8fd81db15e3e3bc904168dd9c84
 Download file
Icon
Name
srWaC1.1.04.xml.gz
Size
646.57 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
88b76975ab0c84add8b69dd2c8b939c2
 Download file
Icon
Name
srWaC1.1.05.xml.gz
Size
645.71 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
3e898abc9f9f86361c6920be564d197d
 Download file
Icon
Name
srWaC1.1.06.xml.gz
Size
352.77 MB
Format
application/gzip
Description
Batch of 100 million tokens in XML (vertical) format.
MD5
c2edac0f4f23aeae4366683516878b5c
 Download file

Show simple item record