Show simple item record

 
dc.contributor.author Donaj, Gregor
dc.contributor.author Antloga, Špela
dc.date.accessioned 2022-11-19T09:28:44Z
dc.date.available 2022-11-19T09:28:44Z
dc.date.issued 2022-11-15
dc.identifier.uri http://hdl.handle.net/11356/1714
dc.description ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus. Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida (http://hdl.handle.net/11356/1035), ParlaMint (http://hdl.handle.net/11356/1431), and The Corpus of Late Modern English Texts (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza (https://github.com/stanfordnlp/stanza) for English and CLASSLA for Slovene (https://github.com/clarinsi/classla) sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool (http://hdl.handle.net/11356/1262) to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatic multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).
dc.language.iso slv
dc.language.iso eng
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.subject parallel corpus
dc.subject TEI
dc.subject idiomatic expressions
dc.title Parallel corpus of idiomatic text ParaDiom 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Gregor Donaj gregor.donaj@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 2933 idiomaticExpressions
size.info 66413 words
size.info 2000 translationUntis
files.count 1
files.size 1173167


 Files in this item

Icon
Name
ParaDiom.TEI.zip
Size
1.12 MB
Format
application/zip
Description
Corpus in TEI format
MD5
f9e07bb9d0e8ae6eae3bb23456fc448d
 Download file  Preview
 File Preview  
  • ParaDiom.TEI
    • ParaDiom-sl-2.xml1 MB
    • ParaDiom-sl-1.xml1 MB
    • schema
      • tei_clarin_schema.xml70 kB
      • tei_clarin_example.xml48 kB
      • tei_clarin.rnc311 kB
      • README.md525 B
      • tei_clarin.rng654 kB
    • ParaDiom-en-4.xml1 MB
    • ParaDiom-en-3.xml1 MB
    • ParaDiom-en-2.xml1 MB
    • mapping.tsv91 kB
    • ParaDiom-en-1.xml1 MB
    • 00README.txt1 kB
    • ParaDiom-sl-4.xml1 MB
    • ParaDiom-sl-3.xml1 MB
    • ParaDiom.xml14 kB

Show simple item record