dc.contributor.author | Knez, Timotej |
dc.contributor.author | Prezelj, Tim |
dc.contributor.author | Žitnik, Slavko |
dc.date.accessioned | 2023-11-12T13:55:37Z |
dc.date.available | 2023-11-12T13:55:37Z |
dc.date.issued | 2023-11-11 |
dc.identifier.uri | http://hdl.handle.net/11356/1895 |
dc.description | The SemSex corpus is designed to facilitate the automated recognition of sexual education concepts within curriculum description documents. The corpus contains two components: PDF documents detailing Slovene school subjects and a structured JSON file named curriculums.json. The first part of the corpus contains the PDF documents describing various school subjects. Annotations within these documents show specific phrases that pertain to one or more sexual education concepts. These annotations serve as markers, aiding in the extraction of relevant information. The second part of the corpus, curriculums.json, is a structured file presenting the extracted texts from the PDF documents in the JSON format. This file encapsulates the textual content extracted from the PDF documents as well as annotations corresponding to the sections that describe sexual education concepts. Each annotation in curriculums.json comprises a list of concepts that the particular description is referencing. The documents were annotated manually using the concepts in the SemSex ontology. The ontology is included in the submission as an additional file in the turtle format and is described in detail at https://github.com/TimotejK/SemSex. The JSON structure of the corpus unfolds as follows: - At the first level, individual objects represent each PDF document. - Under 'pages,' a list is provided for each document, encapsulating various pages. - 'content': This section contains a string representing all the text extracted from the corresponding PDF page. - 'page_number': Each page is tagged with the original page number from the PDF document. - 'annotations': A list associated with each page, outlining specific annotations. - 'start_index': Denotes the starting index of the annotated phrase within the page content. - 'end_index': Specifies the concluding index of the annotated phrase within the page content. - 'labels': This list encompasses the concepts to which the annotated phrase refers. The models that have been trained on this corpus are available at http://hdl.handle.net/11356/1894. |
dc.language.iso | slv |
dc.publisher | CLARIN.SI |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/TimotejK/SemSex |
dc.subject | education |
dc.subject | natural language processing |
dc.subject | sex ed |
dc.subject | knowledge extraction |
dc.title | Corpus for identifying sex education concepts SemSex 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Timotej Knez timotej.knez@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 8.04 mb |
files.count | 2 |
files.size | 8630042 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (8.23 MB)To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Ime
- SemSex_corpus.zip
- Velikost
- 8.04 MB
- Format
- application/zip
- Opis
- SemSex corpus
- MD5
- 424bbc4b8b42b4f665af11e234ca0758
- SemSex_corpus
- Curriculum_pdf
- 3 UN_fizika.pdf355 kB
- 8 UN_naravoslovje_in_tehnika.pdf377 kB
- 9 UN_likovna_vzgoja.pdf513 kB
- 5 UN_geografija.pdf384 kB
- 14 UN_DDE_OS.pdf336 kB
- 12 UN_tehnika_tehnologija.pdf347 kB
- 17 UN_sportna_vzgoja.pdf535 kB
- 13 UN_gospodinjstvo.pdf348 kB
- 6 UN_matematika.pdf618 kB
- 11 UN_Biologija.pdf493 kB
- 7 UN_slovenscina.pdf1 MB
- 1 UN_spoznavanje_okolja_pop-2.pdf600 kB
- 4 UN_kemija.pdf335 kB
- 2 UN_zgodovina.pdf473 kB
- 10 UN_glasbena_vzgoja.pdf373 kB
- 16 UN_naravoslovje.pdf898 kB
- 15 UN_druzba_OS.pdf305 kB
- curriculums.json2 MB
- Curriculum_pdf

- Ime
- SemSEX.ttl
- Velikost
- 191.22 KB
- Format
- Neznano
- Opis
- SemSex ontology
- MD5
- 86af6728344f1434d537a9aacfabc22c