Projects | Jay Polra

Geometry-Grounded Novel-View Acoustic Synthesis

Thu, 01 Jan 2026 00:00:00 +0000

Prior audio-visual methods depend on Structure-from-Motion to build geometry: expensive, brittle with sparse frames, and unavailable at inference in many real scenes.

Built a feed-forward pipeline using VGGT geometry encoding and a cross-attention decoder that routes on geometry but retrieves learned acoustic priors. No COLMAP, no target-view image required.

Key architectural decision: separate keys and values in cross-attention so visual geometry decides which reference views are relevant, but the actual retrieved information is the learned acoustic fingerprint of those views, not the visual features themselves.

Outperformed prior SOTA (AV-Cloud) on all four metrics (MAG, ENV, LRE, DPAM) with fewer parameters (3.24M vs 3.91M) and 10x faster preprocessing than COLMAP.

AI Hazard Recognition System

Wed, 01 Jan 2025 00:00:00 +0000

The core research problem: how does a detection-based safety system stay reliable when part of the physical environment is physically invisible to the camera?

Detection alone fails here. A vehicle that disappears into a blind spot does not mean the zone is safe. Built entry-exit state tracking as a conservative safety heuristic: assume worst case until exit is confirmed by a boundary camera.

Built blocker-aware zone isolation across 4 sequential camera feeds. Real-time pipeline runs YOLO detection and DeepSORT tracking at 8 FPS with under 50ms zone state updates. Full React dashboard for safety officer monitoring.

VLM Explainer: From Patches to Phrases

Wed, 01 Jan 2025 00:00:00 +0000

BLIP generates fluent captions but gives no account of why. The question this project answers: do the image regions the model highlights actually cause the caption, or do they just correlate with it?

Built Grad-CAM token attribution to identify which image regions influenced specific caption words. Validated through perturbation testing: masking the dog region changed “a family walking with their dog” to “a family walking on the beach.” Causal influence, not correlation, is the standard.

Added CLIP alignment scoring to cross-verify image-text grounding. Interactive Streamlit interface with 2-click region masking and layer-evolution visualization across shallow-to-deep BLIP representations.

VLM-Based Hazard Reasoning

Wed, 01 Jan 2025 00:00:00 +0000

YOLO can flag a red zone but cannot explain why the scene is dangerous. This project tests whether general-purpose VLMs can fill that explainability gap using structured domain-aware prompting.

Prompts give models physical context about melt shop environments: what pot haulers are, what molten metal implies, what worker corridors mean for safety. Two-stage reasoning pipeline: first describe the scene neutrally, then evaluate against safety conditions.

Output covers scene description, detected entities, spatial relationships, hazard assessment, risk level, and recommended action.

Candidate Matcher

Mon, 01 Jan 2024 00:00:00 +0000

Keyword overlap misses semantic alignment. A resume can match a job description on surface terms and still be irrelevant.

Built ATS-style ranking using MiniLM embeddings and cosine similarity. LLM-generated summaries explain the match rather than just scoring it. Robust multi-format parsing pipeline for PDF, DOCX, and TXT files with fallback extraction logic. Deployed on Streamlit Cloud with Gemma 2B for summarization and Flan-T5 as a CPU-compatible fallback.