Corpus of combined Slovenian corpora MetaFida 0.1

Erjavec, Tomaž

dc.contributor.author	Erjavec, Tomaž
dc.date.accessioned	2022-12-05T10:48:27Z
dc.date.available	2022-12-05T10:48:27Z
dc.date.issued	2021-11-27
dc.identifier.uri	http://hdl.handle.net/11356/1746
dc.description	Slovenia has a large number of diverse corpora available for online analysis via the CLARIN.SI concordancers. However, if users are interested in the same queries across different corpora, they have to search for relevant information in each corpus separately, and then combine this information manually, which is time-consuming and also prone to analysis errors. An additional problem is that corpora typically have different metadata and may also be labeled at different linguistic levels, which further complicates identical searches across different corpora. For these reasons we combined a number of existing corpora of the Slovenian available through the CLARIN.SI concordances into the MetaFida corpus. Here it was first necessary to unify the metadata and harmonize the linguistic and structural annotations between the corpora, and to create conversions of individual corpora from their vertical formats, which are used as input by the CLARIN.SI concordances, into the MetaFida vertical format. As the source corpora are not completely distinct, MetaFida is deduplicated on the level of paragraphs. In the MetaFida corpus, we kept only that information that is common to most of the selected corpora. The structure is nested very shallowly, as it is easier to create subcorpus or limit the search to individual text types. All Metafida positional attributes are considered to have multiple values, separated by a space. More values are needed because some corpora have normalized words (older Slovenian, user-generated content), where one original word can be mapped to several normalized ones or vice versa. There are 34 corpora included in this version of MetaFida: * classlawiki_sl, CLASSLAWiki-sl (Slovenian Wikipedia), 54,608,642 tokens * dgt15_sl, EU DGT 2015: Slovene, 62,303,744 tokens * dsi, DSI (informatics), 5,245,073 tokens * eltec_slv, ELTeC-slv (100 novels), 6,901,534 tokens * filmi, FILMI (film reviews), 936,446 tokens * gfida20_dedup, Gigafida v2.0 (reference, deduplicated), 1.333,360,653 tokens * gos_vl42, GosVL 4.2 (spoken, VideoLectures), 179,063 tokens * gos11, Gos 1.1.1 (reference, speech), 1,063,861 tokens * imp, IMP (older texts), 17,723,874 tokens * ispac_sl, ISPAC: Slovenian, 1,432,798 tokens * janes_blog, Janes Blog (blogs with comments), 34,534,431 tokens * janes_forum, Janes Forum (web forums), 47,066,575 tokens * janes_news, Janes News (news comments), 14,838,074 tokens * janes_tweet, Janes Tweet (tweets 2013-2017), 151,457,091 tokens * janes_wiki, Janes Wiki (Wikipedia comments), 5,008,067 tokens * jaslo_sl, jaSlo: Slovenian, 532,395 tokens * kas_dipl, KAS Dipl (diplomas), 1,101,796,659 tokens * kas_dr, KAS Dr (PhD theses), 101,473,395 tokens * kas_mag, KAS Mag (master theses), 495,827,656 tokens * konji, Konji (equestrianism), 469,894 tokens * korp, KoRP (public relations), 2,194,130 tokens * lemonde_sl, LeMonde: Slovenian, 615,617 tokens * maj68, Maj68 (May 1968 in literature), 794,382 tokens * maks, MAKS (youth literature), 12,072,273 tokens * prilit, PriLit (older narrative prose), 1,275,209 tokens * rsdo5, RSDO5 (term-annotated texts), 310,588 tokens * sbsj, SBSJ (school texts), 1,836,810 tokens * siparl20, siParl 2.0 (parliament 1990-2018), 239,749,733 tokens * slwac, slWaC (Slovene Web), 895,903,321 tokens * solar, Šolar v2 Clear (school essays), 1,907,731 tokens * suss, ŠUSS (FAQ on Slovenian language), 365,371 tokens * trans5_sl, TRANS5: Slovenian, 1,594,120 tokens * tweet_sl, Tweet-sl (older tweets), 6,291,820 tokens * vayna, VAYNA (attacks on the YNA), 300,666 tokens Σ 34 corpora, 4,601,971,696 tokens
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://www.clarin.si/info/wp-content/uploads/2022/10/MetaFida_MDS_2022.pdf
dc.relation.isreplacedby	http://hdl.handle.net/11356/1775
dc.source.uri	https://rsdo.slovenscina.eu/en/language-resources
dc.subject	reference corpus
dc.title	Corpus of combined Slovenian corpora MetaFida 0.1
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	no
branding	CLARIN.SI data & tools
contact.person	Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info	4601971696 tokens
size.info	3646106563 words
size.info	15338754 texts
files.count	0
files.size	0
featuredService.kontext	search\|https://www.clarin.si/kontext/query?corpname=mfida01
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=mfida01

Show simple item record

Partners

Partners

Repository