| dc.contributor.author | Martinc, Matej |
| dc.date.accessioned | 2025-09-23T15:08:25Z |
| dc.date.available | 2025-09-23T15:08:25Z |
| dc.date.issued | 2025-09-18 |
| dc.identifier.uri | http://hdl.handle.net/11356/2050 |
| dc.description | This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian. 1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model. 2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format. 3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation. 5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation. The combined dataset includes a total of 1,128,228 examples, categorized as follows: 21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens. 349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions. 81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates. 66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image. 78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image. 139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025. 100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025. 100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025. 100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025. Accessing the Corresponding Images News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image. Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py. |
| dc.language.iso | slv |
| dc.publisher | Jožef Stefan Institute |
| dc.rights | Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://www.cjvt.si/llm4dh/ |
| dc.subject | large language models |
| dc.subject | multimodal |
| dc.subject | vision-language models |
| dc.subject | instruction following dataset |
| dc.title | Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Matej Martinc matej.martinc@ijs.si Jožef Stefan Institute |
| sponsor | Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds |
| size.info | 1128228 texts |
| size.info | 2.0 gb |
| files.count | 1 |
| files.size | 485155716 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
- Name
- SLO-VLM-IT-Dataset-1.0.zip
- Size
- 462.68 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- ca84c9cee474b22439808529f467a46f
- SLO-VLM-IT-Dataset-1.0
- siol_100k.json234 MB
- overwatch.py4 kB
- rtv_100k.json205 MB
- readme.txt4 kB
- wiki_14_march_2024_latest.json390 MB
- get_llava_images.py7 kB
- 24ur_100k.json223 MB
- llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json885 MB