Geometry-Grounded Novel-View Acoustic Synthesis

Thu, 01 Jan 2026 00:00:00 +0000

Prior audio-visual methods depend on Structure-from-Motion to build geometry: expensive, brittle with sparse frames, and unavailable at inference in many real scenes.

Built a feed-forward pipeline using VGGT geometry encoding and a cross-attention decoder that routes on geometry but retrieves learned acoustic priors. No COLMAP, no target-view image required.

Key architectural decision: separate keys and values in cross-attention so visual geometry decides which reference views are relevant, but the actual retrieved information is the learned acoustic fingerprint of those views, not the visual features themselves.

Outperformed prior SOTA (AV-Cloud) on all four metrics (MAG, ENV, LRE, DPAM) with fewer parameters (3.24M vs 3.91M) and 10x faster preprocessing than COLMAP.

VLM Explainer: From Patches to Phrases

Wed, 01 Jan 2025 00:00:00 +0000

BLIP generates fluent captions but gives no account of why. The question this project answers: do the image regions the model highlights actually cause the caption, or do they just correlate with it?

Built Grad-CAM token attribution to identify which image regions influenced specific caption words. Validated through perturbation testing: masking the dog region changed “a family walking with their dog” to “a family walking on the beach.” Causal influence, not correlation, is the standard.

Added CLIP alignment scoring to cross-verify image-text grounding. Interactive Streamlit interface with 2-click region masking and layer-evolution visualization across shallow-to-deep BLIP representations.

PyTorch | Jay Polra

Geometry-Grounded Novel-View Acoustic Synthesis

VLM Explainer: From Patches to Phrases