Show simple item record

 
dc.contributor.author Knez, Timotej
dc.contributor.author Prezelj, Tim
dc.contributor.author Žitnik, Slavko
dc.date.accessioned 2023-11-12T13:55:37Z
dc.date.available 2023-11-12T13:55:37Z
dc.date.issued 2023-11-11
dc.identifier.uri http://hdl.handle.net/11356/1895
dc.description The SemSex corpus is designed to facilitate the automated recognition of sexual education concepts within curriculum description documents. The corpus contains two components: PDF documents detailing Slovene school subjects and a structured JSON file named curriculums.json. The first part of the corpus contains the PDF documents describing various school subjects. Annotations within these documents show specific phrases that pertain to one or more sexual education concepts. These annotations serve as markers, aiding in the extraction of relevant information. The second part of the corpus, curriculums.json, is a structured file presenting the extracted texts from the PDF documents in the JSON format. This file encapsulates the textual content extracted from the PDF documents as well as annotations corresponding to the sections that describe sexual education concepts. Each annotation in curriculums.json comprises a list of concepts that the particular description is referencing. The documents were annotated manually using the concepts in the SemSex ontology. The ontology is included in the submission as an additional file in the turtle format and is described in detail at https://github.com/TimotejK/SemSex. The JSON structure of the corpus unfolds as follows: - At the first level, individual objects represent each PDF document. - Under 'pages,' a list is provided for each document, encapsulating various pages. - 'content': This section contains a string representing all the text extracted from the corresponding PDF page. - 'page_number': Each page is tagged with the original page number from the PDF document. - 'annotations': A list associated with each page, outlining specific annotations. - 'start_index': Denotes the starting index of the annotated phrase within the page content. - 'end_index': Specifies the concluding index of the annotated phrase within the page content. - 'labels': This list encompasses the concepts to which the annotated phrase refers. The models that have been trained on this corpus are available at http://hdl.handle.net/11356/1894.
dc.language.iso slv
dc.publisher CLARIN.SI
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/TimotejK/SemSex
dc.subject education
dc.subject natural language processing
dc.subject sex ed
dc.subject knowledge extraction
dc.title Corpus for identifying sex education concepts SemSex 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Timotej Knez timotej.knez@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 8.04 mb
files.count 2
files.size 8630042


 Files in this item

 Download all files in item (8.23 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
SemSex_corpus.zip
Size
8.04 MB
Format
application/zip
Description
SemSex corpus
MD5
424bbc4b8b42b4f665af11e234ca0758
 Download file  Preview
 File Preview  
  • SemSex_corpus
    • Curriculum_pdf
      • 3 UN_fizika.pdf355 kB
      • 8 UN_naravoslovje_in_tehnika.pdf377 kB
      • 9 UN_likovna_vzgoja.pdf513 kB
      • 5 UN_geografija.pdf384 kB
      • 14 UN_DDE_OS.pdf336 kB
      • 12 UN_tehnika_tehnologija.pdf347 kB
      • 17 UN_sportna_vzgoja.pdf535 kB
      • 13 UN_gospodinjstvo.pdf348 kB
      • 6 UN_matematika.pdf618 kB
      • 11 UN_Biologija.pdf493 kB
      • 7 UN_slovenscina.pdf1 MB
      • 1 UN_spoznavanje_okolja_pop-2.pdf600 kB
      • 4 UN_kemija.pdf335 kB
      • 2 UN_zgodovina.pdf473 kB
      • 10 UN_glasbena_vzgoja.pdf373 kB
      • 16 UN_naravoslovje.pdf898 kB
      • 15 UN_druzba_OS.pdf305 kB
    • curriculums.json2 MB
Icon
Name
SemSEX.ttl
Size
191.22 KB
Format
Unknown
Description
SemSex ontology
MD5
86af6728344f1434d537a9aacfabc22c
 Download file

Show simple item record