Enhancing Endoscopic Image Retrieval via Self-Supervised Learning and Large VLM-Based Re-ranking

Main Repo Code (Track 2) Code (Track 3) arXiv (Coming Soon)

✅ Accepted at ACM MM Workshop • 🏆 Prize at Grand Challenge
🏆 Top-2 Track 3 (Text-to-Image retrieval) 🏅 Top-5 Track 2 (Image-to-Image retrieval)

Abstract

Medical image retrieval is essential for clinical diagnosis and medical education, yet remains highly challenging in endoscopic imaging due to limited annotated data, the lack of domain-specific pretrained models, and subtle visual similarities across anatomical regions. In this work, we utilize self-supervised contrastive learning to pretrain a strong image encoder tailored for endoscopic data, which serves as the backbone for downstream retrieval tasks. For text-to-image retrieval, we adopt a multi-modal contrastive learning approach that aligns textual and visual representations based on this pretrained backbone. To further enhance retrieval performance, we propose a novel re-ranking module that leverages the reasoning capabilities of large vision-language models (LVLMs), such as GPT-4o and Gemini. We also provide a comparative analysis of various retrieval strategies, offering insights into their effectiveness in clinical scenarios. Our method achieves top-2 in text-to-image and top-5 in image-to-image retrieval at the ENTRep Challenge 2025, demonstrating its potential value for endoscopic image retrieval.

Experimental Results

Table 1: Comparison of team performances on image-to-image (Track 2) and text-to-image retrieval (Track 3) tasks of the final round (private test) in the ENTRep Challenge. Our team, ELO, is highlighted in bold.

Table 2: Supplemental Study on the split test set of the training set of the ENTRep Challenge. We report HitRate@top-k ↑ and MRR@top-k ↑ for the Image-to-Image and Text-to-Image Retrieval tasks using different approaches: Without Fine-tuning, SSL Fine-tuning, and Multi-modal Fine-tuning. Each cell shows: HitRate — MRR.

Figure 1: Visualization of feature representations using PCA projection with three components on three classes of four images. The first PCA component is used to filter out background variations. Each group illustrates the comparison between original endoscopic images (left), the PCA-projected features extracted from a pretrained DINOv2 backbone without domain-specific training (middle), and the features from the same backbone after fine-tuning on the medical dataset (right).

Figure 2: Comparison between standard inference (Original) and LDSF (Reranked) across three DINO-based models on a 100-sample subset from the text-to-image retrieval test set. The reranking approach consistently improves both Hit Rate and Mean Reciprocal Rank (MRR) metrics, with the most significant gains observed at higher Top-k values.

Method Overview

Our approach combines self-supervised learning, multimodal contrastive learning, and LVLM-based re-ranking to boost image-to-image and text-to-image retrieval in the ENTRep Challenge.

1. Self-Supervised Image Encoder

We train an image encoder \(f_\phi\) for endoscopic images using a SimCLR-inspired self-supervised contrastive learning strategy. Each original image is augmented twice to create a positive pair \((x, x^+)\), while all other augmented samples act as negatives \(x^-\). With temperature parameter \(\tau\), the InfoNCE loss is:

\[ \mathcal{L}_x = - \log \frac{\exp( sim(z_x, z_{x^+}) / \tau )} {\exp( sim(z_x, z_{x^+}) / \tau ) + \sum_{z_{x^-}} \exp( sim(z_x, z_{x^-}) / \tau )} \] \[ \mathcal{L} = \frac{1}{2B} \sum_{x} \mathcal{L}_x \]

2. Multimodal Contrastive Learning

The pretrained image encoder \(f_\phi\) is fine-tuned jointly with a text encoder \(f_\theta\) for cross-modal alignment. Given matched image–text pairs \((x, q)\) and their negatives \((x^-, q^-)\), the losses are:

\[ \mathcal{D}_{x,q} = - \log \frac{\exp( sim(z_q, z_x)/\tau )} {\exp( sim(z_q, z_x)/\tau ) + \sum_{z_{x^-}} \exp( sim(z_q, z_{x^-})/\tau )} \] \[ \mathcal{H}_{x,q} = - \log \frac{\exp( sim(z_x, z_q)/\tau )} {\exp( sim(z_x, z_q)/\tau ) + \sum_{z_{q^-}} \exp( sim(z_x, z_{q^-})/\tau )} \] \[ \mathcal{L} = \frac{1}{2B} \sum_{(x,q)} \left( \mathcal{D}_{x,q} + \mathcal{H}_{x,q} \right) \]

3. LVLM-based Dual-Score Fusion (LDSF) Re-ranking

To address the similarity of clinical descriptions and improve text-to-image retrieval, we introduce LDSF re-ranking. The system first retrieves top-\(k\) candidates using cosine similarity \(s_{\text{init}}\), then an LVLM (e.g., Gemini 2.5 Pro) estimates semantic relevance \(s_{\text{LVLM}}\). The final score is:

\[ s_{\text{final}}(z_q, z_x) = \frac{1}{2} \left[ s_{\text{init}}(z_q, z_x) + s_{\text{LVLM}}(z_q, z_x) \right] \]

Figure 3: Illustrating the capability of LVLMs (e.g., Gemini) to understand and interpret endoscopic images.

Figure 4: Instruction prompt for large VLMs to evaluate the relevance score of the retrieval results.

@inproceedings{khoalinhduy-entrep2025, author = {Khoa Tran, Linh Ly and Ngoc Hoang Luong}, title = {{nhancing Endoscopic Image Retrieval via Self-Supervised Learning and Large VLM-Based Re-ranking}}, year = {2025} }