Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Ferme, Marko
dc.contributor.author Borovič, Mladen
dc.contributor.author Boškovič, Borko
dc.contributor.author Ojsteršek, Milan
dc.contributor.author Hrovat, Goran
dc.date.accessioned 2021-03-30T12:40:31Z
dc.date.available 2021-03-30T12:40:31Z
dc.date.issued 2021-03-31
dc.identifier.uri http://hdl.handle.net/11356/1420
dc.description The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 million words) from 62,000 BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic Slovene. This corpus is made available because the public version of KAS (http://hdl.handle.net/11356/1244) does not contain the front matter, and hence the abstracts. The abstracts were identified on a per-page basis, and are either in Slovenian (*-abs-sl.txt, 47,273 files), English (*-abs-en.tx, 49,261 files) or, for cases where the abstracts in both languages were on the same page, in both languages (*-abs-slen.txt, 11,720 files). The files contain the plain text of the abstracts, one paragraph per line. Note that as the cleaning of source PDF files and identification of the abstracts was done automatically, this corpus contains various types of errors. The files are stored in the same manner as for the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. The file with the metadata for the corpus texts is also included. The abstracts can be useful for research in e.g. machine translations and terminology extraction, and, using also the full texts from the KAS corpus, for studies in automatic summarisation.
dc.language.iso slv
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.relation.isreferencedby https://rdcu.be/b7GrB
dc.relation.isreplacedby http://hdl.handle.net/11356/1449
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri http://nl.ijs.si/kas/
dc.subject abstracts
dc.subject PhD theses
dc.subject MSc/MA theses
dc.subject BSc/BA theses
dc.subject academic writing
dc.title Abstracts from the KAS corpus KAS-Abs 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-7094 Slovene scientific texts: resources and description nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 108254 texts
size.info 30728838 words
files.count 1
files.size 187689506


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
kas-abs.txt.tgz
Size
178.99 MB
Format
Unknown
Description
Corpus in plain text format
MD5
fff250557ec4bc25c3f8cac378296a14
 Download file

Show simple item record