Prikaži enostavni zapis vnosa

 
dc.contributor.author Martinc, Matej
dc.date.accessioned 2025-09-23T15:08:25Z
dc.date.available 2025-09-23T15:08:25Z
dc.date.issued 2025-09-18
dc.identifier.uri http://hdl.handle.net/11356/2050
dc.description This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian. 1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model. 2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format. 3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation. The combined dataset includes a total of 1,128,228 examples, categorized as follows: 21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens. 349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions. 81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates. 66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image. 78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image. 139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025. 100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025. 100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025. 100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025. Accessing the Corresponding Images News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image. Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/
dc.subject large language models
dc.subject multimodal
dc.subject vision-language models
dc.subject instruction following dataset
dc.title Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Matej Martinc matej.martinc@ijs.si Jožef Stefan Institute
sponsor Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
size.info 1128228 texts
size.info 2.0 gb
files.count 1
files.size 485155716


 Datoteke v tem vnosu

Icon
Ime
SLO-VLM-IT-Dataset-1.0.zip
Velikost
462.68 MB
Format
application/zip
Opis
Unknown
MD5
ca84c9cee474b22439808529f467a46f
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • SLO-VLM-IT-Dataset-1.0
    • siol_100k.json234 MB
    • overwatch.py4 kB
    • rtv_100k.json205 MB
    • readme.txt4 kB
    • wiki_14_march_2024_latest.json390 MB
    • get_llava_images.py7 kB
    • 24ur_100k.json223 MB
    • llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json885 MB

Prikaži enostavni zapis vnosa