Our approach combines self-supervised learning, multimodal contrastive learning, and LVLM-based re-ranking to boost image-to-image and text-to-image retrieval in the ENTRep Challenge.
1. Self-Supervised Image Encoder
We train an image encoder \(f_\phi\) for endoscopic images using a SimCLR-inspired self-supervised contrastive learning strategy. Each original image is augmented twice to create a positive pair \((x, x^+)\), while all other augmented samples act as negatives \(x^-\). With temperature parameter \(\tau\), the InfoNCE loss is:
\[ \mathcal{L}_x = - \log \frac{\exp( sim(z_x, z_{x^+}) / \tau )} {\exp( sim(z_x, z_{x^+}) / \tau ) + \sum_{z_{x^-}} \exp( sim(z_x, z_{x^-}) / \tau )} \] \[ \mathcal{L} = \frac{1}{2B} \sum_{x} \mathcal{L}_x \]
2. Multimodal Contrastive Learning
The pretrained image encoder \(f_\phi\) is fine-tuned jointly with a text encoder \(f_\theta\) for cross-modal alignment. Given matched imageβtext pairs \((x, q)\) and their negatives \((x^-, q^-)\), the losses are:
\[ \mathcal{D}_{x,q} = - \log \frac{\exp( sim(z_q, z_x)/\tau )} {\exp( sim(z_q, z_x)/\tau ) + \sum_{z_{x^-}} \exp( sim(z_q, z_{x^-})/\tau )} \] \[ \mathcal{H}_{x,q} = - \log \frac{\exp( sim(z_x, z_q)/\tau )} {\exp( sim(z_x, z_q)/\tau ) + \sum_{z_{q^-}} \exp( sim(z_x, z_{q^-})/\tau )} \] \[ \mathcal{L} = \frac{1}{2B} \sum_{(x,q)} \left( \mathcal{D}_{x,q} + \mathcal{H}_{x,q} \right) \]
3. LVLM-based Dual-Score Fusion (LDSF) Re-ranking
To address the similarity of clinical descriptions and improve text-to-image retrieval, we introduce LDSF re-ranking. The system first retrieves top-\(k\) candidates using cosine similarity \(s_{\text{init}}\), then an LVLM (e.g., Gemini 2.5 Pro) estimates semantic relevance \(s_{\text{LVLM}}\). The final score is:
\[ s_{\text{final}}(z_q, z_x) = \frac{1}{2} \left[ s_{\text{init}}(z_q, z_x) + s_{\text{LVLM}}(z_q, z_x) \right] \]

Figure 3: Illustrating the capability of LVLMs (e.g., Gemini) to understand and interpret endoscopic images.

Figure 4: Instruction prompt for large VLMs to evaluate the relevance score of the retrieval results.