Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0

Name: Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0
License: https://creativecommons.org/licenses/by-nc/4.0/

Martinc, Matej

Show simple item record

dc.contributor.author	Martinc, Matej
dc.date.accessioned	2025-09-23T15:08:25Z
dc.date.available	2025-09-23T15:08:25Z
dc.date.issued	2025-09-18
dc.identifier.uri	http://hdl.handle.net/11356/2050
dc.description	This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian. 1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model. 2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format. 3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation. The combined dataset includes a total of 1,128,228 examples, categorized as follows: 21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens. 349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions. 81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates. 66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image. 78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image. 139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025. 100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025. 100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025. 100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025. Accessing the Corresponding Images News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image. Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/llm4dh/
dc.subject	large language models
dc.subject	multimodal
dc.subject	vision-language models
dc.subject	instruction following dataset
dc.title	Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Martinc matej.martinc@ijs.si Jožef Stefan Institute
sponsor	Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
size.info	1128228 texts
size.info	2.0 gb
files.count	1
files.size	485155716

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Name: SLO-VLM-IT-Dataset-1.0.zip
Size: 462.68 MB
Format: application/zip
Description: Unknown
MD5: ca84c9cee474b22439808529f467a46f

Download file Preview

File Preview

SLO-VLM-IT-Dataset-1.0
- siol_100k.json234 MB
- overwatch.py4 kB
- rtv_100k.json205 MB
- readme.txt4 kB
- wiki_14_march_2024_latest.json390 MB
- get_llava_images.py7 kB
- 24ur_100k.json223 MB
- llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json885 MB

Show simple item record

Files in this item

Partners

Partners

Repository