CorefUD conversion of Slovene coreference resolution corpus coref149

Name: CorefUD conversion of Slovene coreference resolution corpus coref149
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Klemen, Matej; Žitnik, Slavko

Show simple item record

dc.contributor.author	Klemen, Matej
dc.contributor.author	Žitnik, Slavko
dc.date.accessioned	2024-11-21T09:26:33Z
dc.date.available	2024-11-21T09:26:33Z
dc.date.issued	2024-11-17
dc.identifier.uri	http://hdl.handle.net/11356/1989
dc.description	This corpus is the CorefUD conversion of the coref149 corpus for coreference resolution in Slovene (http://hdl.handle.net/11356/1182). It contains 149 documents annotated with coreference information. Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U). The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks the end of the entity e0 at the current token. For full details on the format, please see http://hdl.handle.net/11234/1-5478. To ensure compliance with the CoNLL-U format, the corpus was automatically annotated with trankit v1.1.2 to obtain lemmas, part of speech tags (UPOS, XPOS - MULTEXT-East V6), features, and dependencies (head, dependency relation). To enable implementation into the SloBENCH evaluation framework (https://slobench.cjvt.si/), we release the labeled training set (containing 100 documents) and the unlabeled test set (containing 49 documents) in the CorefUD format. Please note that the labels are available in the original coref149 corpus but omitted here to deter misuse of the test set labels. In comparison to the original coref149 corpus, this contains the same texts and coreference information in a different (more universal) format.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.subject	coreference resolution
dc.subject	corefud
dc.subject	CONLL-U
dc.title	CorefUD conversion of Slovene coreference resolution corpus coref149
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Klemen matej.klemen@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	149 texts
files.count	3
files.size	1824178

Files in this item

Download all files in item (1.74 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: coref149_corefud_train.conllu
Size: 1.19 MB
Format: Unknown
Description: Labeled coref149 training set in CoNLL-U format.
MD5: 52637114f35442028eb178647c71a872

Download file

Name: coref149_corefud_test_unlabeled.conllu
Size: 563.83 KB
Format: Unknown
Description: Unlabeled coref149 test set in CoNLL-U format.
MD5: f72542c02f9149250a19010d0e835428

Download file

Name: README.txt
Size: 1.91 KB
Format: Text file
Description: Description of the resource.
MD5: bf4de0b48e5d082dac5e4cb71121b209

Download file Preview

File Preview

CorefUD conversion of Slovene coreference resolution corpus coref149
v1.0
http://hdl.handle.net/11356/1989
CC BY-NC-SA 4.0

This corpus is the CorefUD conversion of the coref149 corpus for coreference resolution in Slovene (http://hdl.handle.net/11356/1182). It contains 149 documents annotated with coreference information: 100 training and 49 test documents. The test documents were selected according to the underlying cluster distribution: most documents contain a small to medium amount of clusters while a few contain a large amount of clusters.

Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U).
The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks . . .

Show simple item record

Files in this item

Partners

Partners

Repository