Prikaži enostavni zapis vnosa

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.date.accessioned 2022-01-27T18:55:32Z
dc.date.available 2022-01-27T18:55:32Z
dc.date.issued 2022-01-26
dc.identifier.uri http://hdl.handle.net/11356/1461
dc.description The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.
dc.language.iso bos
dc.language.iso hrv
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/C12-1160/
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject news corpus
dc.subject language identification
dc.subject closely related languages
dc.title The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info 9258 texts
files.count 1
files.size 21132170


 Datoteke v tem vnosu

Icon
Ime
SETimes.HBS.zip
Velikost
20.15 MB
Format
application/zip
Opis
Dataset archive
MD5
f0ef513a161d6120793e9271a7340f6f
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • SETimes.HBS.json57 MB
    • SETimes.HBS.txt585 B

Prikaži enostavni zapis vnosa