Turning Simple User Requests into AI-Understandable Instructions
Expanding User Queries So AI Can Clearly Understand Intent
EVA is a system that operates based on user-issued commands. For EVA to make stable and accurate decisions, it is crucial that user requests are delivered in a form that AI can clearly understand.
However, even if the natural language expressions we use daily seem simple and clear to humans, they can be ambiguous from an AI model’s perspective, or they may require excessive implicit reasoning. This gap is exactly what often leads to AI system malfunctions or inaccurate decisions.
To fundamentally address this, EVA uses a Few-Shot prompting technique to automatically expand simple user requests into a structured query representation.
In this post, we focus on:
- Why simple natural-language requests are difficult for AI
- How query expansion can improve AI’s understanding
- How much performance improved in actual field deployments
and share practical methods and their impact for helping AI understand user intent more clearly.
1. “Notify me if someone collapses” — Simple on the surface, hard for AI
Let’s imagine a safety manager gives EVA the following request:
“Notify me if someone collapses.”
This sentence seems sufficiently clear for a human. However, in reality, it embeds various situations and decision criteria, and can be quite confusing for AI.
Diverse real-world scenarios
In the camera view, we frequently encounter scenes that are hard to interpret:
-
A person lying down next to someone sitting: We need to decide whether the person lying down has actually collapsed, or is simply resting.
-
Workers partially occluded by equipment or structures, making posture hard to determine: How do we decide collapse if only part of the body is visible?
-
Someone stretching or bending on the floor, appearing similar to collapse: It’s a normal activity — should this trigger an alert?
-
A person lying down on a sofa or in a rest area: Clearly a resting situation, but how should AI distinguish this from an emergency?
While analyzing these complex situations, we gained an important insight:
For AI to make accurate judgments, it needs clearly structured information.
For example, we observed cases where the system determined “a collapsed person is present” and sent an alert to the manager, but the explanation output said “the person is lying down, but it does not appear to be an emergency.”
This observation confirmed that a single sentence is not enough for AI to fully understand the context of complex real-world situations and produce consistent decisions — additional explicit information is needed.
2. Why AI misinterprets human requests
So why is such a simple request difficult for AI to interpret correctly?
The complex tasks a VLM must handle simultaneously
The reason is that EVA’s VLM (Vision-Language Model) must, in effect, perform multiple complex reasoning steps at once when it processes a single user sentence.
Based only on a single image frame and a single-line user request, AI must sequentially and consistently process the following steps:
1. Understand the scene and human states in the image
- Determine how many people are present
- Analyze what posture each person is in
- Understand the surrounding environment and situation holistically
2. Interpret what specific situation the user wants to detect
- Infer what the term “collapse” concretely means in this context
- Identify the characteristics of situations the user considers dangerous
3. Decide whether the current scene matches an “alert-worthy” situation
- Combine the scene characteristics and user’s request
- Decide whether this is truly a dangerous situation
4. Check whether the situation falls under normal exceptions (resting, working postures, etc.)
- Determine whether the person is simply resting, stretching, or in a normal work posture
- Consider special work postures or environmental factors
5. Make the final decision whether to send an alert and explain why
- Integrate all previous judgments into a final conclusion
- Provide a logical explanation of why that decision was made
The limits of a single natural-language sentence
In EVA, the VLM was expected to perform this complex chain of scene understanding → intent interpretation → rule evaluation → exception handling → alert decision all at once.
The core problem is that the specific criteria and conditions needed to do this correctly are not fully specified in the user’s single natural-language sentence.
Humans can rely on prior experience and common sense to fill in these gaps implicitly. But for AI, if the rules and criteria are not explicitly spelled out, it is forced to infer everything from a small amount of vague information — a structurally difficult problem.
Two critical types of errors
Because of these limitations, two types of critical errors may occur in real operation:
False Positives – Normal situations mistaken as dangerous
- Without clear criteria, normal scenes are easily misclassified as dangerous.
- Repeated unnecessary alerts cause alert fatigue; eventually operators may ignore even genuinely important alarms.
False Negatives – Dangerous situations missed
- Without explicit guidance on exceptions, the system may ignore scenes that are actually dangerous.
- Failing to detect real incidents essentially nullifies the value of a safety monitoring system.
- This is not just a technical problem — it can lead directly to real-world accidents.
From this analysis, EVA recognized that it is necessary to transform natural-language requests into a form that AI can understand more precisely. To achieve this, we designed and developed an Enriched Input system that presents clear, structured decision criteria so AI can understand complex real-world situations more accurately and consistently.
3. Translating user requests into AI-understandable language: Enriched Input
We confirmed that a single natural-language sentence does not provide enough detail for AI to make accurate decisions. To solve this, EVA developed a system that automatically converts a user request into structured conditions.
We call this approach Enriched Input (query expansion).
Fundamental limitations of the previous approach
Previously, user requests were passed to the AI as-is.
User input:
“Notify me if someone collapses.”
Content passed to AI:
“Notify me if someone collapses.” (unchanged)
This sentence is sufficiently understandable for humans. However, as we have seen, it compresses too much meaning and relies heavily on implicit assumptions, making it difficult for AI to carry out precise reasoning.
Enriched Input: Splitting the request into two explicit axes
When EVA’s Enriched Input system receives a simple user request, it automatically analyzes and expands it into two structured axes:
Detection Conditions
“If these conditions are met, trigger an alert.”
Detection conditions explicitly specify the core factors that the AI must check. They replace vague expressions with concrete criteria so that AI knows exactly what to look for.
Example: Collapse detection
- A human is present in the image
- At least one person is completely lying on the floor
- The posture appears more like a collapse than intentional resting
In this way, detection conditions provide clear rules such as “if this kind of situation is observed, you should pay attention.”
Exception Conditions
“In these cases, do not trigger an alert.”
Exception conditions describe cases that meet detection patterns but are still normal, ensuring false positives are significantly reduced.
Example: Exceptions for collapse detection
- The person is simply sitting or squatting on the floor performing normal work
- The worker’s body is mostly occluded by machines or structures, making posture hard to verify
- The person is clearly resting on a sofa or in a designated rest area
Exception conditions act as a safety filter that says: “These situations are not dangerous.”
Before vs. After Enriched Input: A concrete example
Let’s look at how a simple user query is transformed before and after applying Enriched Input:
| User Request | Enriched Input Result |
|---|---|
| Find people who are not wearing masks | Detection Steps : * At least one person is present * At least one person is not wearing a mask Exceptions : * Everyone is wearing a mask * It is not possible to determine mask usage due to camera angle or similar limitations |
| Find people who have collapsed. | Detection Steps : * At least one person is present * At least one person appears to have collapsed Exceptions : * All collapsed people are only partially visible from the upper body (head/waist) * All collapsed people have lower body (legs/feet) out of view * All collapsed people are simply looking at their phones * All collapsed people are lying while leaning on a chair * All collapsed people are too indistinct to identify shape clearly * There are no collapsed people |
| Among people who are seated and working, find those who are not wearing masks | Detection Steps : * At least one person is present * At least one person is seated * At least one person is working * At least one person is seated and working, and no mask shape is visible on their face Exceptions : * All seated and working people are wearing masks * It is not possible to determine mask usage for seated and working people due to camera angle or similar limitations |
By explicitly separating detection conditions and exception conditions like this, AI no longer has to interpret and infer everything from a vague sentence. Instead, it can make systematic and consistent decisions based on clearly defined criteria.
It’s similar to the difference between saying to a human, “Just handle it,” and saying, “Check these conditions, ignore these exceptions, and then decide.”
How real-world user requests become more specific over time
In actual operations, we observed that user requests tend to become more concrete and refined over time.
Initial request:
"Detect people not wearing masks."
First refinement (considering work context):
"Detect people not wearing masks among those who are seated and working."
(Exclude people who are standing or walking.)
Second refinement (adding more exceptions):
"Detect people not wearing masks among those who are seated and working,
but exclude people working with laptops."
(Office staff may not be subject to mask requirements.)
The strength of the Enriched Input approach is that it can easily reflect this progressive refinement. Users can review the automatically generated structured criteria and intuitively add or modify detection and exception conditions to match field requirements — without needing programming skills or deep technical knowledge.
4. Enriched Input Experiment Results: More Accurate, More Trustworthy Detection
We ran comparative experiments between the original “raw natural language” approach and Enriched Input under the same model settings.
Two representative scenarios are summarized below.
Collapse detection performance
- About 10% reduction in false positive rate
- Overall detection accuracy maintained at 99%
Mask non-compliance detection performance
A task that detects workers who are not wearing masks — one that becomes more difficult as camera angles and the number of people vary.
- Detection accuracy improved from 52% → 90%
- More than 80% reduction in missed detections (false negatives)
5. Conclusion: When the criteria are clear, AI’s decisions become clear
From the development and testing of Enriched Input, we learned one key lesson:
For AI, what matters is not only what you show it, but how clearly you explain it.
Simply passing a natural-language user request to the model is often not enough for AI to accurately understand and reason about complex real-world situations. But when we systematically structure that request, and explicitly separate detection conditions from exception conditions, AI can make far more accurate and reliable decisions.
EVA’s Enriched Input approach is not just a better way to process input. It is a core technique for systematically bridging the gap between user intent and AI understanding.
Through this:
- Users can continue to express requests comfortably in natural language
- The system automatically converts them into AI-friendly structured forms
- Users can easily adjust those rules to match field realities
- AI can make clear, consistent decisions based on explicit criteria
References
Key technical documents and papers referenced in this work:
- vLLM GitHub & docs – PagedAttention, continuous batching, high-throughput serving (GitHub)
- PagedAttention Korean blog explanation (Pangyoalto Blog)
- Qwen2.5 / Qwen3 / Qwen3-VL overview and Hugging Face cards (Hugging Face)
- Patterns of reducing false positives in 2-stage / cascade detectors (MathWorks)
- Surveys and studies on LLM hallucination and stepwise mitigation pipelines (MDPI)


