Show simple item record

 
dc.contributor.author Krsnik, Luka
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Terčon, Luka
dc.date.accessioned 2024-12-06T13:47:53Z
dc.date.available 2024-12-06T13:47:53Z
dc.date.issued 2024-12-06
dc.identifier.uri http://hdl.handle.net/11356/1996
dc.description This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965 (v1.2 or newest). The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission. In comparison with version 1.0, this model was trained on a new train-dev-test split of the SST treebank introduced in release UD v2.15.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby https://arxiv.org/pdf/2101.03289.pdf
dc.relation.replaces http://hdl.handle.net/11356/1966
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/licenses/Apache-2.0
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/trankit-train
dc.subject language model
dc.subject lemmatisation
dc.subject tokenisation
dc.subject sentence segmentation
dc.subject part-of-speech tagging
dc.subject feature prediction
dc.subject parsing
dc.subject dependency parsing
dc.subject corpus annotation
dc.title Trankit model for SST 2.15 1.1
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count 1
files.size 145552575


 Files in this item

This item is
Publicly Available
and licensed under:
Apache License 2.0
Icon
Name
tranikt-sl-sst.zip
Size
138.81 MB
Format
application/zip
Description
Language model
MD5
10117255a4be5b99299c1983ab448ca1
 Download file  Preview
 File Preview  
  • save_dir_sst
    • xlm-roberta-base
      • customized
        • customized_lemmatizer.pt-1 B
        • customized.downloaded-1 B
        • customized.vocabs.json-1 B
        • customized.tokenizer.mdl-1 B
        • customized.tagger.mdl-1 B
    • xlm-roberta-large
      • customized
        • customized_lemmatizer.pt-1 B
        • customized.downloaded-1 B
        • customized.vocabs.json-1 B
        • customized.tokenizer.mdl-1 B
        • customized.tagger.mdl-1 B

Show simple item record