Skip to main content

From Image to Language, From Language to Reasoning: Boosting VLM Performance with Camera Context

· 7 min read
Minjun Son
Minjun Son
POSTECH
Jisu Kang
Jisu Kang
AI Specialist

This work is a collaborative research effort with Minjoon Son (advised by Prof. Youngmyung Ko) as part of the "Campus: Beyond Safety to Intelligence – Postech Living Lab Project with EVA"


📝 Introduction: Making User Queries Smarter: Enhancing Language with Image Context

EVA is a system that detects anomalies using hundreds to thousands of smart cameras. We utilized VLM/LLM to automatically infer the camera context and embedded this into the prompt, creating a camera-context aware anomaly detection pipeline that reflects the situation of the target image. By leveraging the camera context extracted from a single frame as prior knowledge for the VLLM, we confirmed a meaningful improvement in accuracy and deeper interpretability compared to the existing baseline.




1. Why "Context-Blind" is a Problem

The camera environments handled by EVA are diverse, including offices, entrances, parking lots, construction sites, and hallways, with each camera serving a unique surveillance purpose. Despite this variety, most conventional Video Anomaly Detection (VAD) systems suffer from fundamental limitations that ignore these environmental and purposive differences.

  • Limitations of Motion-only Judgment: "A person running in a hallway" versus "a person running in a sports field" have identical movements, but their meaning from a security perspective is completely different. Without context, motion data alone can lead to both being mistaken as 'anomalous' or both as 'normal'.

  • Lack of Domain Knowledge: General-purpose VLM/VLLMs possess abundant general knowledge but lack specialized expertise in the security and surveillance domain. This makes them prone to generating ambiguous or inaccurate explanations, or even hallucinations, in critical situations.

These issues ultimately lead to High False Alarms and contribute to Alert Fatigue for system operators.






2. EVA’s Core Question and Approach

Our research began with the following central question:

"For thousands of cameras lacking any context information, could we automatically infer 'where, what, and for what purpose' the camera is pointed, simply by analyzing the video?"

To realize this, we defined three main objectives and designed the pipeline accordingly:

  1. Automatic Camera Context Inference: Developing a methodology to automatically extract high-level information from video frames, such as viewpoint (Context), purpose, main objects (Object), and activities (Activity).
  2. Finding Optimal Representation for LLM/VLM Prompts: Experimenting and analyzing which format—JSON, summarized sentences, etc.—is most efficient for providing context to the AI model in terms of performance, cost, and response time.
  3. Context-Utilizing Pipeline Design: Focusing on designing an integrated pipeline based on Detection + Exception Logic, which utilizes the image context to determine anomaly status, going beyond simple object detection.



3. What is Camera Context?

Through initial industry-academia meetings, we materialized Camera Context not merely as simple metadata, but as a form of knowledge that corrects and strengthens the detection logic.

Key Contextual Elements:

  • Location: Type of scene, such as parking lot, entrance, construction site, hallway, or outdoor field.
  • Camera Orientation/Angle: Viewpoint information, such as whether it's facing an entrance head-on or looking down from above.
  • Lighting/Environmental Factors: Indoor/outdoor classification, lighting conditions, shadow direction, and other physical elements.
  • Assumptions on Main Objects/Activities: Example: In a construction site context, the assumption that "workers are primarily wearing helmets" must be reflected in the anomaly detection stage.

This context is used to construct the Exception Logic, going beyond the Detection itself. Exception Logic refers to the rules that combine the detection results with the context to make the final judgment on whether a situation is anomalous.

Initially, we structured the context in JSON format for the LLM/VLM. Through various experiments, we confirmed that providing the context in a structured JSON format (as shown below) resulted in better anomaly detection performance than providing it as a summarized string.

{
  "camera_id": "C1234",
  "location_type": "indoor",
  "view_focus": "entrance_gate",
  "main_object": ["person", "bag"],
  "purpose": "access control",
  "activity": "entry/exit monitoring"
}



4. Single-Frame Context: Summarizing the Entire Video in One Image

Inputting the entire video clip into the VLM every time is impractical regarding cost efficiency and latency. Therefore, we focused on extracting the core context using a highly representative single-frame.

Three-Stage Pipeline Design

To extract higher-quality, in-depth context and utilize it for anomaly detection, we designed the overall process in three distinct stages:

  1. Offline Context Extraction (Static Context Extraction):

    • Select a representative frame that clearly shows the background.
    • Use VLM (e.g., Qwen3-VL-8B-Instruct) to analyze the scene type (Factory floor, Hallway, etc.), potential Risk Scenarios, main objects, and layout.
    • Generate this information as structured text (JSON) that the AI can understand, saving it as Static Context (prior knowledge).
  2. Online Criteria Formulation (Online Criteria Generation):

    • Combine the user-defined Anomaly Detection Query with the pre-extracted Static Context.
    • The VLLM automatically generates specific Rule-Based Criteria on "what behavior should be considered anomalous?" Example: "In this factory environment, it is considered risky if a worker stays near machinery without a helmet for more than 10 seconds."
  3. Integrated Video Reasoning (Integrated Inference):

    • Input the actual video clip, the User Query, the Static Context, and the Generated Criteria into the VLLM.
    • The VLLM uses all this contextual information to determine the anomaly status and outputs an Explainable result that describes "why this action is considered risky in the current frame."

Thanks to this structure, instead of analyzing the entire video every time, we can reuse the extracted context while performing in-depth, context-aware reasoning.




5. What Tangible Improvements Were Achieved?

Experiments and analysis utilizing real-world datasets (UCF Crime, AI Hub, etc.) clearly demonstrated the effectiveness of the context-aware approach.

  • Significant Accuracy Improvement: When static context was provided to the VLLM as prior knowledge, anomaly detection accuracy saw a meaningful improvement. Performance was particularly enhanced for behaviors that are only meaningful in specific environments (e.g., not wearing a safety helmet in a construction site, loitering near an entrance).

  • Maximized Interpretability: The model moved beyond simple anomaly detection to explain "why it deems this scene dangerous," based on camera location/purpose, potential risk factors, and user-defined detection scenarios. This critically aids operators in quickly judging the reliability of an alert.

  • Potential for False Alarm Reduction: The system can accurately differentiate between the same "running" behavior, classifying it as normal in a sports field context but anomalous in an indoor hallway context. This directly contributes to reducing false alarm rates and operator fatigue.

  • Confirmation of Pipeline Design Importance: We confirmed that designing a context-aware detection pipeline that is flexible across diverse environments is far more crucial for practical service operation than simply building a classification model optimized for a single environment.




6. Future Roadmap

This research marks the starting point for a roadmap that expands from context extraction toward Knowledge Internalization.

Future research and development will proceed in the following directions:

  1. Context Element and Representation Advancement: Deeply analyzing which context elements are most significant for performance across different camera types, installation purposes, and detection scenarios. We plan to experiment with hybrid context representations combining JSON and natural language summaries.

  2. Adaptive VLLM Framework Development: Designing an Adaptive framework, based on LangGraph, that automatically selects different VLM/LLM combinations and prompt templates depending on the situation and query complexity. This will be provided as an EVA Agent API.

  3. Large-Scale Operational Validation: Applying this pipeline in a real EVA service environment across multiple camera clusters. We will validate comprehensive operational metrics, including alert reduction rate, False Positive/Negative rates, and operator feedback, with and without context utilization.

  4. Security Domain Knowledge Internalization: Gradually accumulating “EVA-specific security domain knowledge” based on recurring anomaly scenarios and operational feedback. We will explore reflecting this knowledge in the VLLM's initial prompts or adapters to enhance the model's intrinsic intelligence.




7. Conclusion

This industry-academia project sought a practical answer to the question: "How can we make the VLM not just say 'what is visible,' but 'why this scene is important for this specific camera right now'?"

We proposed a Camera Context-Aware VAD Pipeline built upon Single-Frame Scene Knowledge and confirmed its potential for meaningful performance and interpretability gains in real-world environments.

Video analysis without context is akin to "reading words without grammar." We proved that context extraction, starting from a single representative frame, can clearly guide the VLLM's focus toward "what the user truly needs to know."