The Trankit model for linguistic processing of written and spoken Slovenian 1.2

Name: The Trankit model for linguistic processing of written and spoken Slovenian 1.2
License: https://opensource.org/licenses/Apache-2.0

Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka

Show simple item record

dc.contributor.author	Krsnik, Luka
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Terčon, Luka
dc.date.accessioned	2024-12-06T13:47:25Z
dc.date.available	2024-12-06T13:47:25Z
dc.date.issued	2024-12-06
dc.identifier.uri	http://hdl.handle.net/11356/1997
dc.description	This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://arxiv.org/pdf/2101.03289.pdf
dc.relation.replaces	http://hdl.handle.net/11356/1965
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/trankit-train
dc.subject	language model
dc.subject	lemmatisation
dc.subject	tokenisation
dc.subject	sentence segmentation
dc.subject	part-of-speech tagging
dc.subject	feature prediction
dc.subject	parsing
dc.subject	dependency parsing
dc.subject	corpus annotation
dc.title	The Trankit model for linguistic processing of written and spoken Slovenian 1.2
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person	Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
files.count	1
files.size	152575109

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: trankit-sl-ssj+sst.zip
Size: 145.51 MB
Format: application/zip
Description: Language Model
MD5: 0ddfac8d7445f8fa300f59dde1a00352

Download file Preview

File Preview

save_dir_ssj+sst
- xlm-roberta-base
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B
- xlm-roberta-large
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B

Show simple item record

Files in this item

Partners

Partners

Repository