Slovenian language resource repository CLARIN.SI

Slovenian language resource repository CLARIN.SI The CLARIN.SI digital repository system captures, stores, indexes, preserves, and distributes digital research material. https://www.clarin.si:443/repository/xmlui 2024-07-23T16:30:01Z 2024-07-23T16:30:01Z The Sarajevo Corpus of SMS Messages in Bosnian 1.1 Wasserscheidt, Philipp Bulić, Halid Durmišević, Elma Hodžić-Čavkić, Azra Bajraktarević, Enisa Ahmetspahić-Peljto, Azra Šabić, Belmin http://hdl.handle.net/11356/1956 2024-07-17T07:56:03Z 2024-07-16T00:00:00Z

The Sarajevo Corpus of SMS Messages in Bosnian 1.1 Wasserscheidt, Philipp; Bulić, Halid; Durmišević, Elma; Hodžić-Čavkić, Azra; Bajraktarević, Enisa; Ahmetspahić-Peljto, Azra; Šabić, Belmin This corpus is specialized, static (i.e., no future growth is planned), diachronic and covers the period from 2002 to 2022. The SMS messages included in this corpus were obtained from voluntary donors (informants). Both senders and recipients of the messages included in the corpus are Bosnian speakers, exhibiting diversity in terms of age, education and occupation, place of origin and countries of long-term residence. The Sarajevo Corpus of SMS Messages in Bosnian was originally published by University of Sarajevo – Faculty of Philosophy as an electronic book. The second phase of the work involved compiling the SMS messages into a corpus and linguistic annotation, which was done using the CLASSLA package (https://github.com/clarinsi/classla), version 2.1, with language = Serbian and type = nonstandard for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies). As opposed to the previous version, this version corrects a number of mistakes in the metadata.

2024-07-16T00:00:00Z Albanian Spoken Corpus in Kosovo 1.0 Wasserscheidt, Philipp Rugova, Bardh Baftiu, Adelajda http://hdl.handle.net/11356/1955 2024-07-09T11:17:35Z 2024-07-08T00:00:00Z

Albanian Spoken Corpus in Kosovo 1.0 Wasserscheidt, Philipp; Rugova, Bardh; Baftiu, Adelajda This is the third version of a spoken corpus of Albanian in Kosovo. The data of the corpus is based on short life stories of 212 informants out of sample of 1800 speakers balanced across all regions of Kosovo and the categories of gender, age and education. In addition, metadata such as place of birth, place of residence, L1, L2, Age group and occupation were collected. The audio data was recorded in 2019 by students from the University of Prishtina. The speech files can be made available on request from one of the authors and will be made publicly available after the finalisation of the transcription in the next version. The transcription was carried out partly at Humboldt-Universität zu Berlin and partly at the University of Prishtina. The transcription is diplomatic (using the standard alphabet but transcribing relevant phonological realisation). It partly follows typical rendering of Gheg dialectal words and uses the HIAT system. The data was annotated using Timofey Arkhangelsky's Uniparser-albanian-grammar (https://bitbucket.org/timarkh/uniparser-albanian-grammar), keeping only non-ambiguous values. A list of tags used in the parser can be found here: http://albanian.web-corpora.net. The data are in CoNLL-U format. This version of the corpus contains the data of 212 speakers aged between 11 and 80, mainly from the regions of Ferizaj, Gjilan, Kaçanik, Mitrovicë, Podujevë, Rahovec and Shtërpcë. As opposed to the previous version, this corpus corrects several errors in the metadata.

2024-07-08T00:00:00Z Monitor corpus of Slovene Trendi 2024-06 Kosem, Iztok Čibej, Jaka Dobrovoljc, Kaja Erjavec, Tomaž Ljubešić, Nikola Ponikvar, Primož Šinkec, Mihael Krek, Simon http://hdl.handle.net/11356/1953 2024-07-04T12:29:08Z 2024-07-03T00:00:00Z

Monitor corpus of Slovene Trendi 2024-06 Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 74 publishers. Trendi 2024-06 covers the period from January 2019 to June 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si). This version adds texts from June 2024.

2024-07-03T00:00:00Z Monitor corpus of Slovene Trendi 2024-05 Kosem, Iztok Čibej, Jaka Dobrovoljc, Kaja Erjavec, Tomaž Ljubešić, Nikola Ponikvar, Primož Šinkec, Mihael Krek, Simon http://hdl.handle.net/11356/1950 2024-07-04T12:28:46Z 2024-05-07T00:00:00Z

Monitor corpus of Slovene Trendi 2024-05 Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 73 publishers. Trendi 2024-05 covers the period from January 2019 to May 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si). This version adds texts from May 2024.

2024-05-07T00:00:00Z Slovenian parliamentary corpus (1990-2022) siParl 4.0 Pančur, Andrej Meden, Katja Erjavec, Tomaž Ojsteršek, Mihael Šorn, Mojca Blaj Hribar, Neja http://hdl.handle.net/11356/1936 2024-07-23T09:15:59Z 2024-06-05T00:00:00Z

Slovenian parliamentary corpus (1990-2022) siParl 4.0 Pančur, Andrej; Meden, Katja; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca; Blaj Hribar, Neja The siParl 4.0 corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 8th legislative period 1992-2022, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 8th legislative period 1996-2022, and minutes of the Council of the President of the National Assembly from the 2nd to the 8th legislative period 1996-2022. The corpus comprises of over 13 thousand sessions, one million speeches and 230 million words. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. As opposed to the previous version 3.0, this version adds new data (minutes of the National Assembly of the Republic of Slovenia of the 8th legislative period) and corrects many errors. This item comprises the following datasets: 1. source DARAH-SI Parla-CLARIN encoded corpus in TEI format; 2. linguistically annotated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities; 3. automatically derived corpus in plain text with metadata on speeches; 4. automatically derived linguisticaly annotated corpus in CoNLL-U (Universal Dependencies) format with metadata on speeches; 5. automatically derived linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers, together with registry file as used on the CLARIN.SI concordancers.

2024-06-05T00:00:00Z Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.1 Kuzman, Taja Ljubešić, Nikola Erjavec, Tomaž Kopp, Matyáš Ogrodniczuk, Maciej Osenova, Petya Rayson, Paul Vidler, John Agerri, Rodrigo Agirrezabal, Manex Agnoloni, Tommaso Aires, José Albini, Monica Alkorta, Jon Antiba-Cartazo, Iván Arrieta, Ekain Barcala, Mario Bardanca, Daniel Barkarson, Starkaður Bartolini, Roberto Battistoni, Roberto Bel, Nuria Bonet Ramos, Maria del Mar Calzada Pérez, María Cardoso, Aida Çöltekin, Çağrı Coole, Matthew Darģis, Roberts de Does, Jesse de Libano, Ruben Depoorter, Griet Depuydt, Katrien Diwersy, Sascha Dodé, Réka Fernandez, Kike Fernández Rei, Elisa Frontini, Francesca Garcia, Marcos García Díaz, Noelia García Louzao, Pedro Gavriilidou, Maria Gkoumas, Dimitris Grigorov, Ilko Grigorova, Vladislava Haltrup Hansen, Dorte Iruskieta, Mikel Jarlbrink, Johan Jelencsik-Mátyus, Kinga Jongejan, Bart Kahusk, Neeme Kirnbauer, Martin Kryvenko, Anna Ligeti-Nagy, Noémi Luxardo, Giancarlo Magariños, Carmen Magnusson, Måns Marchetti, Carlo Marx, Maarten Meden, Katja Mendes, Amália Mochtak, Michal Mölder, Martin Montemagni, Simonetta Navarretta, Costanza Nitoń, Bartłomiej Norén, Fredrik Mohammadi Nwadukwe, Amanda Ojsteršek, Mihael Pančur, Andrej Papavassiliou, Vassilis Pereira, Rui Pérez Lago, María Piperidis, Stelios Pirker, Hannes Pisani, Marilina Pol, Henk van der Prokopidis, Prokopis Quochi, Valeria Regueira, Xosé Luís Rii, Andriana Rudolf, Michał Ruisi, Manuela Rupnik, Peter Schopper, Daniel Simov, Kiril Sinikallio, Laura Skubic, Jure Tamper, Minna Tungland, Lars Magne Tuominen, Jouni van Heusden, Ruben Varga, Zsófia Vázquez Abuín, Marta Venturi, Giulia Vidal Miguéns, Adrián Vider, Kadri Vivel Couso, Ainhoa Vladu, Adina Ioana Wissik, Tanja Yrjänäinen, Väinö Zevallos, Rodolfo Fišer, Darja http://hdl.handle.net/11356/1910 2024-06-04T18:51:37Z 2024-06-03T00:00:00Z

Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.1 Kuzman, Taja; Ljubešić, Nikola; Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Rayson, Paul; Vidler, John; Agerri, Rodrigo; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Does, Jesse; de Libano, Ruben; Depoorter, Griet; Depuydt, Katrien; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Regueira, Xosé Luís; Rii, Andriana; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tamper, Minna; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja ParlaMint-en.ana 4.1 is the English machine translation of the ParlaMint.ana 4.1 (http://hdl.handle.net/11356/1911) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas). Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account. The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include pyMusas USAS tags. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments.

2024-06-03T00:00:00Z Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1 Erjavec, Tomaž Kopp, Matyáš Ogrodniczuk, Maciej Osenova, Petya Agerri, Rodrigo Agirrezabal, Manex Agnoloni, Tommaso Aires, José Albini, Monica Alkorta, Jon Antiba-Cartazo, Iván Arrieta, Ekain Barcala, Mario Bardanca, Daniel Barkarson, Starkaður Bartolini, Roberto Battistoni, Roberto Bel, Nuria Bonet Ramos, Maria del Mar Calzada Pérez, María Cardoso, Aida Çöltekin, Çağrı Coole, Matthew Darģis, Roberts de Does, Jesse de Libano, Ruben Depoorter, Griet Depuydt, Katrien Diwersy, Sascha Dodé, Réka Fernandez, Kike Fernández Rei, Elisa Frontini, Francesca Garcia, Marcos García Díaz, Noelia García Louzao, Pedro Gavriilidou, Maria Gkoumas, Dimitris Grigorov, Ilko Grigorova, Vladislava Haltrup Hansen, Dorte Iruskieta, Mikel Jarlbrink, Johan Jelencsik-Mátyus, Kinga Jongejan, Bart Kahusk, Neeme Kirnbauer, Martin Kryvenko, Anna Ligeti-Nagy, Noémi Ljubešić, Nikola Luxardo, Giancarlo Magariños, Carmen Magnusson, Måns Marchetti, Carlo Marx, Maarten Meden, Katja Mendes, Amália Mochtak, Michal Mölder, Martin Montemagni, Simonetta Navarretta, Costanza Nitoń, Bartłomiej Norén, Fredrik Mohammadi Nwadukwe, Amanda Ojsteršek, Mihael Pančur, Andrej Papavassiliou, Vassilis Pereira, Rui Pérez Lago, María Piperidis, Stelios Pirker, Hannes Pisani, Marilina Pol, Henk van der Prokopidis, Prokopis Quochi, Valeria Rayson, Paul Regueira, Xosé Luís Rii, Andriana Rudolf, Michał Ruisi, Manuela Rupnik, Peter Schopper, Daniel Simov, Kiril Sinikallio, Laura Skubic, Jure Tamper, Minna Tungland, Lars Magne Tuominen, Jouni van Heusden, Ruben Varga, Zsófia Vázquez Abuín, Marta Venturi, Giulia Vidal Miguéns, Adrián Vider, Kadri Vivel Couso, Ainhoa Vladu, Adina Ioana Wissik, Tanja Yrjänäinen, Väinö Zevallos, Rodolfo Fišer, Darja http://hdl.handle.net/11356/1911 2024-07-17T08:08:30Z 2024-06-03T00:00:00Z

Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1 Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Agerri, Rodrigo; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Does, Jesse; de Libano, Ruben; Depoorter, Griet; Depuydt, Katrien; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Ljubešić, Nikola; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Rayson, Paul; Regueira, Xosé Luís; Rii, Andriana; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tamper, Minna; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used. This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project. This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/1912. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 4.1 can be found at http://hdl.handle.net/11356/1910. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has been linguistically re-annotated to remove bugs, while its speeches are now also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, which also has improved language marking (uk vs. ru) on segments.

2024-06-03T00:00:00Z Multilingual comparable corpora of parliamentary debates ParlaMint 4.1 Erjavec, Tomaž Kopp, Matyáš Ogrodniczuk, Maciej Osenova, Petya Agirrezabal, Manex Agnoloni, Tommaso Aires, José Albini, Monica Alkorta, Jon Antiba-Cartazo, Iván Arrieta, Ekain Barcala, Mario Bardanca, Daniel Barkarson, Starkaður Bartolini, Roberto Battistoni, Roberto Bel, Nuria Bonet Ramos, Maria del Mar Calzada Pérez, María Cardoso, Aida Çöltekin, Çağrı Coole, Matthew Darģis, Roberts de Libano, Ruben Depoorter, Griet Diwersy, Sascha Dodé, Réka Fernandez, Kike Fernández Rei, Elisa Frontini, Francesca Garcia, Marcos García Díaz, Noelia García Louzao, Pedro Gavriilidou, Maria Gkoumas, Dimitris Grigorov, Ilko Grigorova, Vladislava Haltrup Hansen, Dorte Iruskieta, Mikel Jarlbrink, Johan Jelencsik-Mátyus, Kinga Jongejan, Bart Kahusk, Neeme Kirnbauer, Martin Kryvenko, Anna Ligeti-Nagy, Noémi Ljubešić, Nikola Luxardo, Giancarlo Magariños, Carmen Magnusson, Måns Marchetti, Carlo Marx, Maarten Meden, Katja Mendes, Amália Mochtak, Michal Mölder, Martin Montemagni, Simonetta Navarretta, Costanza Nitoń, Bartłomiej Norén, Fredrik Mohammadi Nwadukwe, Amanda Ojsteršek, Mihael Pančur, Andrej Papavassiliou, Vassilis Pereira, Rui Pérez Lago, María Piperidis, Stelios Pirker, Hannes Pisani, Marilina Pol, Henk van der Prokopidis, Prokopis Quochi, Valeria Rayson, Paul Regueira, Xosé Luís Rii, Andriana Rudolf, Michał Ruisi, Manuela Rupnik, Peter Schopper, Daniel Simov, Kiril Sinikallio, Laura Skubic, Jure Tungland, Lars Magne Tuominen, Jouni van Heusden, Ruben Varga, Zsófia Vázquez Abuín, Marta Venturi, Giulia Vidal Miguéns, Adrián Vider, Kadri Vivel Couso, Ainhoa Vladu, Adina Ioana Wissik, Tanja Yrjänäinen, Väinö Zevallos, Rodolfo Fišer, Darja http://hdl.handle.net/11356/1912 2024-06-04T18:45:24Z 2024-06-03T00:00:00Z

Multilingual comparable corpora of parliamentary debates ParlaMint 4.1 Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Libano, Ruben; Depoorter, Griet; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Ljubešić, Nikola; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Rayson, Paul; Regueira, Xosé Luís; Rii, Andriana; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint. Note that there also exists the linguistically marked-up version of the 4.1 ParlaMint corpus (http://hdl.handle.net/11356/1911) as well as a version machine translated to English (http://hdl.handle.net/11356/1910). Both are linked with CLARIN.SI concordancers for on-line analysis. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has now speeches also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, where UA also has improved language marking (uk vs. ru) on segments.

2024-06-03T00:00:00Z Comprehensive Slovenian-Hungarian Dictionary 2.0 Kosem, Iztok Bálint Čeh, Júlia Ponikvar, Primož Zaranšek, Petra Kamenšek, Urška Koša, Peter Gróf, Annamária Böröcz, Nándor Harmat Császár, Jolanda Szíjártó, Imre Šantak, Borut Gantar, Polona Krek, Simon Roblek, Rebeka Zgaga, Karolina Logar, Urban Pori, Eva Arhar Holdt, Špela Gorjanc, Vojko Šešet, Jure Potoczky, Klára Laskowski, Cyprian Bombek, Miha Dragar, Luka http://hdl.handle.net/11356/1946 2024-06-03T18:24:00Z 2024-04-04T00:00:00Z

Comprehensive Slovenian-Hungarian Dictionary 2.0 Kosem, Iztok; Bálint Čeh, Júlia; Ponikvar, Primož; Zaranšek, Petra; Kamenšek, Urška; Koša, Peter; Gróf, Annamária; Böröcz, Nándor; Harmat Császár, Jolanda; Szíjártó, Imre; Šantak, Borut; Gantar, Polona; Krek, Simon; Roblek, Rebeka; Zgaga, Karolina; Logar, Urban; Pori, Eva; Arhar Holdt, Špela; Gorjanc, Vojko; Šešet, Jure; Potoczky, Klára; Laskowski, Cyprian; Bombek, Miha; Dragar, Luka The Comprehensive Slovenian-Hungarian dictionary is a general bilingual dictionary that is being compiled at the Centre for Language Resources and Technologies of the University of Ljubljana (CJVT UL). Version 2.0 contains 15,362 headwords, 61,190 translations, 28,748 collocations and other word combinations, and 7,741 examples. The file also contains links between synonymous entries or entry senses, and links between single-word headwords and compounds/phrases. The Comprehensive Slovenian-Hungarian dictionary is a growing dictionary, which means that new headwords will be added in regular intervals. The Comprehensive Slovenian-Hungarian dictionary is based on a concept (Kosem et al. 2018) that was prepared in the targeted research project KOMASS (the Concept of Hungarian-Slovenian dictionary: from a language resource to its user), funded by the Slovenian Research Agency and the Ministry of Education, Science and Sport of the Republic of Slovenia. The dictionary concept follows the state-of-the-art international lexicographic practice, e.g. bilingual dictionaries compiled at established international publishers and institutes. In the second version, nearly 5,000 entries have been added, and some corrections to the old ones were also made. Moreover, additional metadata has been included, e.g. lemma and tags for headwords and collocations, and statistical and syntactic structure information on collocations. The contact person for dictionary-related questions is Iztok Kosem (iztok.kosem@ff.uni-lj.si).

2024-04-04T00:00:00Z Offensive language dataset of French comments FRENK-fr 1.0 Pahor de Maiti Tekavčič, Kristina Ljubešić, Nikola Fišer, Darja http://hdl.handle.net/11356/1947 2024-05-28T12:02:54Z 2024-05-27T00:00:00Z

Offensive language dataset of French comments FRENK-fr 1.0 Pahor de Maiti Tekavčič, Kristina; Ljubešić, Nikola; Fišer, Darja The FRENK-fr dataset contains French socially unacceptable and acceptable comments posted in response to news articles that cover the topics of LGBT and migrants, and which were posted on Facebook by prominent French media outlets (20 minutes, Le Figaro and Le Monde). The original thread order of comments based on the time of publishing is preserved in the dataset. These comments were manually annotated for the type and target of socially unacceptable comments. The creation process, including data collection, filtering, annotation schema and annotation procedure, was adopted from the FRENK 1.1 dataset (http://hdl.handle.net/11356/1462), which makes FRENK-fr fully comparable to the datasets of Croatian, English and Slovenian comments included in the FRENK 1.1. Apart from manual annotation of the type and target of socially unacceptable discourse, the comments are accompanied with metadata, namely the topic of the news item (LGBT or migrants) that triggered the comment, the news item itself and the media outlet authoring it, an anonymised user ID, and information about the reply level in the thread. The dataset consists of 10,239 Facebook comments posted under 66 news items. It includes 3,071 comments that were labelled as socially unacceptable, and 7,168 that were labelled as socially acceptable.

2024-05-27T00:00:00Z