Skip to main content

Teaching VLMs to Multitask: Enhancing Situation Awareness through Scenario Decomposition

· 8 min read
Hyunchan Moon
Hyunchan Moon
AI Specialist

At the core of EVA lies the ability to truly understand critical situations that occur simultaneously within a single scene—such as fires, people falling, or traffic accidents—without missing any of them. However, no matter how capable a Vision-Language Model (VLM) is, asking it to reason about too many things at once leads to a sharp degradation in cognitive performance.[2,3]

In this post, inspired by the recent text-to-video retrieval research Q₂E (Query-to-Event Decomposition)[1], we introduce Scenario Decomposition, a technique that enables VLMs to deeply understand complex, multi-scenario situations within a single frame.




1. 🚀 The Problem: Cognitive Bottlenecks in VLMs

EVA Agents are required to monitor complex urban roads and multi-use facilities 24/7. In the early stages, we trusted the general-purpose capabilities of VLMs and expected them—like humans—to grasp all situations at once.


[Initial Approach: Unified Query] "Look at this CCTV footage and check whether there is a fire, a fallen person, a traffic accident, or an emergency vehicle entering."


In practice, however, the model revealed the limits of selective perception. For example, it may clearly recognize a large bus in the center of the frame, while treating fire indicators or a fallen person behind it as mere background noise.

This occurs because the model’s attention mechanism is spread across multiple targets, resulting in shallow contextual understanding of critical risk signals.




2. 💡 The Solution: Extending Cognition with Q₂E

To address this issue, we adopted the core philosophy of Q₂E: “Complex events become easier to understand when they are decomposed.”


2.1. Key Insight from Q₂E (Query-to-Event)

The Q₂E paper demonstrates that instead of searching with a single term like “wildfire”, decomposing it into an event flow—precursor → progression → outcome—allows the model to retrieve and understand the event far more effectively.


2.2. EVA Agent Adoption: Scenario Decomposition

We extended this idea into Multi-Scenario Understanding. Rather than asking a VLM to observe everything broadly at once, we assign it distinct perspectives, each focused on a specific scenario.


  • Before (Single Perspective): “Notify me if there is a fire, a fallen person, or a traffic accident.” → (Ambiguous, attention dispersed)

  • After (Decomposed Perspectives):

    1. [Fire Perspective]: “Focus on pixel changes, colors, and smoke textures to identify signs of fire.”
    2. [Safety Monitoring Perspective]: “Analyze human pose and interactions with surrounding objects to detect falls.”
    3. [Traffic Monitoring Perspective]: “Examine vehicle collisions or abnormal stopping behavior to determine accidents.”

By decomposing the query this way, the VLM’s visual encoder activates strong, scenario-specific attention. Even when viewing the same image, the model gains clarity on what it should perceive, enabling it to detect previously overlooked situations.




3. 🏛️ System Design: A Specialized Multi-Agent Architecture

Overall Pipeline

The EVA Agent processes complex scenarios through the following five stages:

  1. Classifier (Single or Multi): Determines the number of scenarios in the input
  2. Scenario Decomposition: Splits multi-scenario inputs into individual scenarios
  3. Scenario Enrichment: Injects rich detection conditions into each scenario [4]
  4. Scenario Clustering: Groups similar scenarios when the count exceeds four
  5. Parallel Inference: Executes VLM inference in parallel and aggregates results (Conceptual illustration: a single image processed through four separate prompt paths)

Stage 1: Scenario Classification

The LLM analyzes the user-provided scenario description to determine whether it represents a single or multiple scenarios.

[e.g.]
"If a fire occurs, or someone falls, or a traffic accident is detected"

[LLM Prompt]
Analyze the following scenario description and determine whether it is
a single scenario or multiple scenarios:
"{user_input}"

[LLM Output]
{
"type": "multi",
"count": 3,
"scenarios": ["fire", "fall", "traffic accident"]
}

Comparison with Q₂E:

  • Q₂E: Performs event decomposition from a single query like “2025 LA Fire”
  • EVA: Performs scenario decomposition from compound queries like “fire or fall or traffic accident”

Stage 2: Scenario Decomposition

Just as Q₂E decomposes a complex query into Prequel / Current / Sequel, we decompose complex scenes into domain-specific directives.

[Q₂E Decomposition]
Query: "2025 LA Fire"
→ Prequel: "What could happen before?"
→ Current: "What happens during?"
→ Sequel: "What could be the outcome?"

[EVA Decomposition]
Query: "fire or fall or traffic accident"
→ Scenario 1: "Fire indicator detection"
→ Scenario 2: "Fall accident detection"
→ Scenario 3: "Traffic accident detection"

Stage 3: Scenario Enrichment

Comparison with Q₂E Refinement:

ComponentQ₂EEVA
InputDecomposed Event
("Building on Fire")
Decomposed Scenario
("Fire indicator detection")
Added InformationTemporal (2025)
+ Spatial (LA)
+ Event (Fire)
Domain Knowledge
(visual features, detection criteria)
Output"Building on Fire
during 2025 LA Fire"
"Fire detection via
refined detection criteria"
PurposeLeverage LLM event knowledgeFocus VLM visual attention


Enrichment Example (See also: Turning Simple User Requests into AI-Understandable Instructions)

AspectBefore (Simple)After (Enriched)
Prompt"Detect fire.""Detection: Fire is present with visible smoke.

Exception: Fire or smoke cannot be clearly identified due to camera angle."
VLM BehaviorShallow scan of entire frameFocused attention on color and texture features


Stage 4: Scenario Clustering (When N > 4)

When the number of enriched prompts exceeds four, semantically similar scenarios are grouped to improve inference efficiency.

Clustering Rule:

IF N > 4:
Ask LLM to group scenarios by visual similarity
into a maximum of four groups
ELSE:
Process each scenario independently
[Example with many scenarios]
Input: ["fire", "smoke", "explosion", "fall", "collapsed person",
"traffic accident", "collision", "emergency vehicle"]

[LLM-based Semantic Grouping]
Group 1 (Fire-related): ["fire", "smoke", "explosion"]
Group 2 (Human safety): ["fall", "collapsed person"]
Group 3 (Traffic): ["traffic accident", "collision"]
Group 4 (Emergency response): ["emergency vehicle"]

→ Reduce 8 scenarios to 4 groups
→ Optimize VLM latency and memory usage

Stage 5: Parallel VLM Inference

Each enriched prompt is sent to the VLM in parallel.

[e.g.]
enriched_prompts = [Group1_prompt, Group2_prompt, Group3_prompt, Group4_prompt]

PARALLEL_EXECUTE:
FOR each prompt IN enriched_prompts:
result = VLM(frame=cctv_image, instruction=prompt)
results.append(result)

WAIT_ALL_COMPLETE

Key Differences:

  • Q₂E analyzes a single video across multiple modalities (video, audio, text)
  • EVA analyzes a single frame across multiple scenarios (fire, fall, traffic)



4. 📊 Case Study: Unified vs. Decomposed Scenarios

This is not about asking more questions—it is about how accurately the model understands the situation.

Using real EVA evaluation footage, we compared unified scenarios and decomposed multi-scenarios on the same complex video.


4.1. Test Scenario

A video containing three critical incidents played sequentially was evaluated using both approaches.


Unified Scenario Prompt

"If at least one emergency vehicle exists,
or a fire occurs,
or at least one person has fallen"

Decomposed Multi-Scenario Prompts

Each scenario includes enriched detection criteria and exceptions.

ScenarioDetection StepsExceptions
Emergency Vehicle• At least one vehicle present
• At least one emergency vehicle identified
• No emergency vehicles
• All vehicles are taxis
• Police lights appear blue only
• Lights not mounted on roof
• Only headlights or brake lights visible
Fall / Traffic Accident• At least one person present
• Person is fallen or traffic accident occurs
• No fallen person
• Only upper body visible
• Lower body not visible
Fire / Smoke• Fire present
• Smoke detected
• No fire or smoke detected

4.2. Experimental Results

ScenarioUnified PromptDecomposed Prompts
Fallen Person✅ Detected✅ Detected
Emergency Vehicle✅ Detected✅ Detected
Fire❌ Missed✅ Detected
Overall Accuracy66.7% (2/3)100% (3/3)

4.3. Detailed Analysis

1) Limitations of Unified Scenarios (Attention Interference)

VLM Output

Analysis:
✅ One fallen person detected (Scene 1)
✅ One emergency vehicle (police car) detected (Scene 2)
❌ No fire indicators detected

Result: One critical incident missed

Issues:

  • Attention bias toward early scenes (people, vehicles)
  • Subtle fire indicators (smoke, flame textures) ignored in favor of salient objects

2) Decomposed Scenario Detection Result

[Fire Detection Result]
⚠️ Fire detected

Detected visual features:
✓ Orange/red flames (Scene 3, bottom-right)
✓ Vertical gray smoke patterns
✓ Irregular bright region boundaries

Location: Bottom-right of frame
Confidence: 0.91
Decision: Fire confirmed (immediate dispatch required)

Why it worked:

  • Specialization: Fire Agent focuses exclusively on fire-related features
  • Interference Reduction: Other scenarios do not introduce perceptual noise



5. Conclusion: “The Depth of the Question Determines the Depth of Understanding”

This work leads to a clear conclusion. What limits VLM performance is not the number of parameters, but how we instruct the model to perceive the world.

Just as Q₂E improves retrieval accuracy by decomposing queries, we demonstrated that Scenario Decomposition enables VLMs to simultaneously and deeply understand multiple situations in complex CCTV monitoring environments.

Moving forward, we plan to extend this approach beyond static image analysis toward Temporal Event Decomposition, enabling models to reason over the temporal context of video streams.

(This post presents a creative adaptation of the Q₂E: Query-to-Event Decomposition methodology to solve Multi-Scenario Cognition challenges in vision-based tasks.)




References

[1] Shubhashis Roy Dipta, Francis Ferraro. "Q₂E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval." arXiv:2506.10202v2, 2025.

[2] Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., & Savarese, S. "Which Tasks Should Be Learned Together in Multi-task Learning?" ICML 2020.

[3] Pratt, S., Covert, I., Liu, R., & Farhadi, A. "What Does CLIP Know About a Red Circle? Visual Prompt Engineering for VLMs." ICCV 2023.

[4] EVA Tech Blog: Turning Simple User Requests into AI-Understandable Instructions