The Trankit model for linguistic process of standard written Slovenian 1.1

Name: The Trankit model for linguistic process of standard written Slovenian 1.1
License: https://opensource.org/licenses/Apache-2.0

Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka

Show simple item record

dc.contributor.author	Krsnik, Luka
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Terčon, Luka
dc.date.accessioned	2024-08-29T10:21:06Z
dc.date.available	2024-08-29T10:21:06Z
dc.date.issued	2024-08-29
dc.identifier.uri	http://hdl.handle.net/11356/1963
dc.description	This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://arxiv.org/pdf/2101.03289.pdf
dc.relation.replaces	http://hdl.handle.net/11356/1870
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/trankit-train
dc.subject	language model
dc.subject	lemmatisation
dc.subject	tokenisation
dc.subject	sentence segmentation
dc.subject	part-of-speech tagging
dc.subject	feature prediction
dc.subject	parsing
dc.subject	dependency parsing
dc.subject	corpus annotation
dc.title	The Trankit model for linguistic process of standard written Slovenian 1.1
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person	Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) Z6-4617 A Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
files.count	1
files.size	150298328

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: save_dir_ssj2.14.zip
Size: 143.34 MB
Format: application/zip
Description: Language model
MD5: d125a3f46187e533f45f68d9b8177812

Download file Preview

File Preview

save_dir_ssj2.14
- xlm-roberta-base
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B
- xlm-roberta-large
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B

Show simple item record

Files in this item

Partners

Partners

Repository