Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2022-12-05T10:48:27Z
dc.date.available 2022-12-05T10:48:27Z
dc.date.issued 2021-11-27
dc.identifier.uri http://hdl.handle.net/11356/1746
dc.description Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across different corpora, they have to search for relevant information in each corpus separately, and then combine this information manually, which is time-consuming and also prone to analysis errors. An additional problem is that corpora typically have different metadata and may also be labeled at different linguistic levels, which further complicates identical searches across different corpora. For these reasons we combined a number of existing corpora of the Slovenian available through the CLARIN.SI concordances into the MetaFida corpus. Here it was first necessary to unify the metadata and harmonize the linguistic and structural annotations between the corpora, and to create conversions of individual corpora from their vertical formats, which are used as input by the CLARIN.SI concordances, into the MetaFida vertical format. As the source corpora are not completely distinct, MetaFida is deduplicated on the level of paragraphs. In the MetaFida corpus, we kept only that information that is common to most of the selected corpora. The structure is nested very shallowly, as it is easier to create subcorpus or limit the search to individual text types. All Metafida positional attributes are considered to have multiple values, separated by a space. More values ​​are needed because some corpora have normalized words (older Slovenian, user-generated content), where one original word can be mapped to several normalized ones or vice versa. There are 34 corpora included in this version of MetaFida: * classlawiki_sl, CLASSLAWiki-sl (Slovenian Wikipedia), 54,608,642 tokens * dgt15_sl, EU DGT 2015: Slovene, 62,303,744 tokens * dsi, DSI (informatics), 5,245,073 tokens * eltec_slv, ELTeC-slv (100 novels), 6,901,534 tokens * filmi, FILMI (film reviews), 936,446 tokens * gfida20_dedup, Gigafida v2.0 (reference, deduplicated), 1.333,360,653 tokens * gos_vl42, GosVL 4.2 (spoken, VideoLectures), 179,063 tokens * gos11, Gos 1.1.1 (reference, speech), 1,063,861 tokens * imp, IMP (older texts), 17,723,874 tokens * ispac_sl, ISPAC: Slovenian, 1,432,798 tokens * janes_blog, Janes Blog (blogs with comments), 34,534,431 tokens * janes_forum, Janes Forum (web forums), 47,066,575 tokens * janes_news, Janes News (news comments), 14,838,074 tokens * janes_tweet, Janes Tweet (tweets 2013-2017), 151,457,091 tokens * janes_wiki, Janes Wiki (Wikipedia comments), 5,008,067 tokens * jaslo_sl, jaSlo: Slovenian, 532,395 tokens * kas_dipl, KAS Dipl (diplomas), 1,101,796,659 tokens * kas_dr, KAS Dr (PhD theses), 101,473,395 tokens * kas_mag, KAS Mag (master theses), 495,827,656 tokens * konji, Konji (equestrianism), 469,894 tokens * korp, KoRP (public relations), 2,194,130 tokens * lemonde_sl, LeMonde: Slovenian, 615,617 tokens * maj68, Maj68 (May 1968 in literature), 794,382 tokens * maks, MAKS (youth literature), 12,072,273 tokens * prilit, PriLit (older narrative prose), 1,275,209 tokens * rsdo5, RSDO5 (term-annotated texts), 310,588 tokens * sbsj, SBSJ (school texts), 1,836,810 tokens * siparl20, siParl 2.0 (parliament 1990-2018), 239,749,733 tokens * slwac, slWaC (Slovene Web), 895,903,321 tokens * solar, Šolar v2 Clear (school essays), 1,907,731 tokens * suss, ŠUSS (FAQ on Slovenian language), 365,371 tokens * trans5_sl, TRANS5: Slovenian, 1,594,120 tokens * tweet_sl, Tweet-sl (older tweets), 6,291,820 tokens * vayna, VAYNA (attacks on the YNA), 300,666 tokens Σ 34 corpora, 4,601,971,696 tokens
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://www.clarin.si/info/wp-content/uploads/2022/10/MetaFida_MDS_2022.pdf
dc.relation.isreplacedby http://hdl.handle.net/11356/1775
dc.source.uri https://rsdo.slovenscina.eu/en/language-resources
dc.subject reference corpus
dc.title Corpus of combined Slovenian corpora MetaFida 0.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files no
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 4601971696 tokens
size.info 3646106563 words
size.info 15338754 texts
files.count 0
files.size 0
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=mfida01
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=mfida01&struct_attr_stats=1


Show simple item record