| dc.contributor.author | Čibej, Jaka |
| dc.contributor.author | Arhar Holdt, Špela |
| dc.contributor.author | Dobrovoljc, Kaja |
| dc.contributor.author | Krek, Simon |
| dc.date.accessioned | 2020-11-02T12:35:03Z |
| dc.date.available | 2020-11-02T12:35:03Z |
| dc.date.issued | 2020-10-28 |
| dc.identifier.uri | http://hdl.handle.net/11356/1363 |
| dc.description | Frequency lists of character-level n-grams were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain 1-5-gram combinations of characters occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy. Character-level n-grams were extracted from lemmas (5 files), lower-case word forms (5 files), and standardized word forms (5 files). Compared to the previous version (http://hdl.handle.net/11356/1268), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project). |
| dc.language.iso | slv |
| dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
| dc.publisher | Jožef Stefan Institute |
| dc.relation.replaces | http://hdl.handle.net/11356/1268 |
| dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | http://slovnica.ijs.si/ |
| dc.subject | spoken corpus |
| dc.subject | frequency list |
| dc.subject | n-grams |
| dc.subject | characters |
| dc.title | Frequency lists of character-level n-grams from the GOS 1.0 corpus 1.1 |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | wordList |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana |
| sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
| sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
| size.info | 15 files |
| files.count | 1 |
| files.size | 2686389 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Ime
- GOS1.0-characters.zip
- Velikost
- 2.56 MB
- Format
- application/zip
- Opis
- Frequency lists of character-level n-grams from GOS1.0
- MD5
- 37c3c093d4c8582eb6505c5ce06ab3b8
- GOS1.0-characters-lemmas
- GOS1.0-characters-lemmas-5grams-taxonomy-entire.tsv4 MB
- GOS1.0-characters-lemmas-2grams-taxonomy-entire.tsv145 kB
- GOS1.0-characters-lemmas-4grams-taxonomy-entire.tsv2 MB
- GOS1.0-characters-lemmas-1grams-taxonomy-entire.tsv12 kB
- GOS1.0-characters-lemmas-3grams-taxonomy-entire.tsv890 kB
- GOS1.0-characters-lowercase_forms
- GOS1.0-characters-lowercase_forms-2grams-taxonomy-entire.tsv77 kB
- GOS1.0-characters-lowercase_forms-4grams-taxonomy-entire.tsv3 MB
- GOS1.0-characters-lowercase_forms-1grams-taxonomy-entire.tsv7 kB
- GOS1.0-characters-lowercase_forms-3grams-taxonomy-entire.tsv753 kB
- GOS1.0-characters-lowercase_forms-5grams-taxonomy-entire.tsv6 MB
- GOS1.0-characters-standardized_forms
- GOS1.0-characters-standardized_forms-2grams-taxonomy-entire.tsv141 kB
- GOS1.0-characters-standardized_forms-4grams-taxonomy-entire.tsv3 MB
- GOS1.0-characters-standardized_forms-1grams-taxonomy-entire.tsv11 kB
- GOS1.0-characters-standardized_forms-3grams-taxonomy-entire.tsv889 kB
- GOS1.0-characters-standardized_forms-5grams-taxonomy-entire.tsv5 MB