VLM Explainer: From Patches to Phrases

Wed, 01 Jan 2025 00:00:00 +0000

BLIP generates fluent captions but gives no account of why. The question this project answers: do the image regions the model highlights actually cause the caption, or do they just correlate with it?

Built Grad-CAM token attribution to identify which image regions influenced specific caption words. Validated through perturbation testing: masking the dog region changed “a family walking with their dog” to “a family walking on the beach.” Causal influence, not correlation, is the standard.

Added CLIP alignment scoring to cross-verify image-text grounding. Interactive Streamlit interface with 2-click region masking and layer-evolution visualization across shallow-to-deep BLIP representations.

VLM-Based Hazard Reasoning

Wed, 01 Jan 2025 00:00:00 +0000

YOLO can flag a red zone but cannot explain why the scene is dangerous. This project tests whether general-purpose VLMs can fill that explainability gap using structured domain-aware prompting.

Prompts give models physical context about melt shop environments: what pot haulers are, what molten metal implies, what worker corridors mean for safety. Two-stage reasoning pipeline: first describe the scene neutrally, then evaluate against safety conditions.

Output covers scene description, detected entities, spatial relationships, hazard assessment, risk level, and recommended action.

Explainable AI | Jay Polra

VLM Explainer: From Patches to Phrases

VLM-Based Hazard Reasoning