This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by Raganato et al. (2023), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal language models using this benchmark. However, all tested models performed worse than the zero-shot CLIP-based baseline model (Radford et al., 2021) used by Raganato et al. (2023) for the English Visual- WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.