Trankit model for linguistic processing of spoken Slovenian

Name: Trankit model for linguistic processing of spoken Slovenian
License: https://opensource.org/licenses/Apache-2.0

Krsnik, Luka; Dobrovoljc, Kaja

Show simple item record

dc.contributor.author	Krsnik, Luka
dc.contributor.author	Dobrovoljc, Kaja
dc.date.accessioned	2024-01-18T17:52:16Z
dc.date.available	2024-01-18T17:52:16Z
dc.date.issued	2024-01-17
dc.identifier.uri	http://hdl.handle.net/11356/1909
dc.description	This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a combination of two datasets published by Universal Dependencies in release 2.12, the spoken SST treebank (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12) and the written SSJ treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.12). Its evaluation on the spoken SST test set yields an F1 score of 97.78 for lemmas, 97.19 for UPOS, 95.05 for XPOS and 81.26 for LAS, a significantly better performance in comparison to the counterpart model trained on written SSJ data only (http://hdl.handle.net/11356/1870). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://arxiv.org/pdf/2101.03289.pdf
dc.relation.isreplacedby	http://hdl.handle.net/11356/1965
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/trankit-train
dc.subject	language model
dc.subject	lemmatisation
dc.subject	tokenisation
dc.subject	sentence segmentation
dc.subject	part-of-speech tagging
dc.subject	feature prediction
dc.subject	parsing
dc.title	Trankit model for linguistic processing of spoken Slovenian
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person	Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
files.count	1
files.size	152176350

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: save_dir_ssj+sst.zip
Size: 145.13 MB
Format: application/zip
Description: Language model
MD5: 0cae56ef990e02ff265b3a94ae3747b4

Download file Preview

File Preview

save_dir_ssj+sst
- xlm-roberta-base
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B
- xlm-roberta-large
  - customized
    - customized_lemmatizer.pt-1 B
    - customized.downloaded-1 B
    - customized.vocabs.json-1 B
    - customized.tokenizer.mdl-1 B
    - customized.tagger.mdl-1 B

Show simple item record

Files in this item

Partners

Partners

Repository