Skip to main content

11 posts tagged with "Research"

AI 연구, 논문 기반 기술 인사이트와 실전 적용 사례를 다룹니다.

View All Tags

Real-Time Streaming Rendering Optimization - Improving EVA Architecture Based on Canvas, Web Worker, OffscreenCanvas

· 7 min read
junhyung yoo
junhyung yoo
Product Developer

Hello, I’m Junhyung Yoo, a frontend engineer on the EVA team.

One of the core features of the EVA service is real-time streaming, which allows users to monitor video feeds from dozens of cameras simultaneously. As usage expanded from brief checks to long-term on-site monitoring, unexpected performance bottlenecks began to surface.

"When I leave the screen on for a long time, the browser gradually slows down and eventually the tab crashes."

To address this issue, we’d like to share our journey of improving the rendering architecture using Canvas, Web Worker, and OffscreenCanvas.


1. Background: Why Did Problems Appear Over Time?

In its early days, EVA adopted a very common streaming approach using the <img> tag combined with Blob (Object URL).

Previous Approach (Blob-Based Rendering)

  1. MJPEG stream data is received from the server in Blob form.
  2. A temporary URL is generated using URL.createObjectURL(blob).
  3. The URL is assigned to the src of an <img> tag, allowing the browser to render the image.

While this implementation was simple, two critical issues emerged in the specialized environment of long-term monitoring.

  • Memory Overhead: A unique URL string is generated for every frame (around 30 frames per second). Even when calling revokeObjectURL, delays in the browser’s internal image cache and garbage collection (GC) caused memory usage to continuously increase, eventually leading to Out of Memory (OOM) errors.
  • Main Thread Blocking: Image decoding occurs on the main (UI) thread. When processing high-resolution video, the event loop is delayed, resulting in UI lag such as slow clicks or scrolling—commonly known as jank.

2. Network Tab Analysis: Understanding MJPEG

The first step in improving performance was analyzing the network layer. MJPEG streaming behaves differently from typical HTTP requests.

multipart/x-mixed-replace

MJPEG uses the Content-Type: multipart/x-mixed-replace; boundary=... header, which allows the server to continuously push image frames over a single HTTP connection.

  • Network Tab Characteristics: The request never completes and remains in a 'Pending' state. The browser keeps the connection open and continuously receives binary data.
  • Binary Data Structure: Each frame consists of JPEG binary data (0xFF 0xD8 ... 0xFF 0xD9) separated by a specific boundary string.

Because the previous approach converted this massive stream of binary data into Blobs and parsed it on the main thread, browser load increased exponentially as more data accumulated.


3. First Optimization: Canvas and createImageBitmap

To move away from memory management that relied heavily on the browser’s garbage collector, we introduced the Canvas API and adopted an explicit memory management approach.

Asynchronous Bitmap Rendering

The createImageBitmap API allows images to be decoded asynchronously in the background before being rendered to the screen.

// @src/entities/devices/components/stream/MJPEGStream.tsx
// Immediately release memory after drawing the bitmap on the canvas
const bitmap = await createImageBitmap(blob);
ctx.drawImage(bitmap, 0, 0);
bitmap.close(); // Explicitly release memory

The key point of this approach is bitmap.close(). By explicitly destroying bitmap resources after use, we were able to keep memory usage stable. In addition, by eliminating reflow caused by changing the src of an <img> tag and switching to GPU-accelerated canvas drawing, overall rendering efficiency was significantly improved.


4. Second Optimization: Separating Computation with Web Workers

While rendering became lighter, the task of receiving stream data and extracting JPEG frames from binary data (boundary parsing) was still handled by the main thread. Performing real-time string searches on millions of bytes per second places a heavy burden on the CPU.

To solve this, we introduced Web Workers and applied a clear division of responsibilities:
"Data processing in the background, rendering on the main thread."

Optimizing Data Transfer (Transferable Objects)

When sending large images from a worker to the main thread, copying data results in severe performance degradation. We leveraged Transferable Objects to transfer ownership of memory without copying, enabling a zero-copy data flow.


5. Final Optimization: Introducing OffscreenCanvas

Despite these improvements, the final drawing step still occurred on the main thread. The final piece of the puzzle was OffscreenCanvas, which allows control of the canvas itself to be transferred to a worker.

Even when the main thread is blocked (left), image processing running in the worker continues to update in real time without interruption. (Source: Kakao Tech Blog)

Toward 0% Rendering Load on the Main Thread

After transferring control using transferControlToOffscreen(), rendering is performed entirely inside the worker.

// @src/entities/devices/components/stream/mjpeg.worker.ts
const bitmap = await createImageBitmap(blob);
if (ctx &amp;&amp; canvas) {
// Worker directly draws on the canvas (0% main-thread interference)
ctx.drawImage(bitmap, 0, 0);

if (config.showArea &amp;&amp; config.area) {
drawPolygonArea(ctx, config.area); // Area overlay logic also runs in the worker
}
}
bitmap.close();

With this architecture, no matter how heavy the workload on the main thread becomes, streaming video continues to play smoothly and independently on a separate thread.

🌐 Browser Compatibility and Automatic Fallback

While OffscreenCanvas offers powerful capabilities, browser support varies. In the EVA service, browser features are automatically detected and conditionally handled based on the user’s environment.

BrowserSupported VersionNotes
Chrome69+Primary support
Edge79+Supported from Chromium-based versions
Firefox105+Enabled by default starting from v105
Safari16.4+Latest macOS/iOS recommended
Opera56+-

EVA’s Adaptive Rendering Strategy:

  • Modern browsers: Enable OffscreenCanvas to keep main-thread load at 0%.
  • Older browsers (e.g., Safari 15 or below): Detect feature availability and automatically fall back to the first optimization approach—main-thread Canvas rendering.

This ensures a seamless streaming experience across all browser environments.


6. Additional Optimizations: Buffer Reuse and Faster Parsing

Performance is determined by details. We applied several additional optimizations within the worker logic.

  1. Fixed Buffer Reuse: Instead of creating new Uint8Array instances each time, we reused fixed-size buffers and managed data using copyWithin. This significantly reduced the frequency of garbage collection (GC).
  2. High-Speed Parsing with indexOf: Rather than using simple loops to find matching bytes in binary data, we leveraged the built-in indexOf method to skip unnecessary byte comparisons. Even this simple optimization dramatically reduced frame drops.

7. Conclusion: A More Robust EVA Monitoring Environment

Through this optimization effort, the EVA service achieved the following results:

  • Memory Stability: Memory usage remains stable even during long-term operation, eliminating OOM errors.
  • UI Responsiveness: UI interactions such as menu navigation and button clicks remain smooth—even during high-resolution streaming—at a near-native app level.
  • Stable Frame Rates: By separating threads, consistent frame rates are maintained regardless of network latency or main-thread load.

This project reinforced a key principle of frontend performance optimization:
"How free you keep the browser’s main thread makes all the difference."

Thank you for reading!


Technology Summary

  • Web Workers API: Execute computations on background threads
  • OffscreenCanvas: Rendering independent of the main thread
  • createImageBitmap: Asynchronous image decoding with explicit memory management
  • Transferable Objects: High-speed data transfer without copy overhead

References

Multi-Frame Based VLM Detection: Moving Beyond Single Image Limits to Temporal Context

· 7 min read
Gyulim Gu
Gyulim Gu
Tech Leader
Seongwoo Kong
Seongwoo Kong
AI Specialist
Taehoon Park
Taehoon Park
AI Specialist
Jisu Kang
Jisu Kang
AI Specialist

Is a Single Frame Enough?

Recently, Vision-Language Models (VLMs) have demonstrated exceptional performance in understanding individual images. Large-scale multimodal models have theoretically expanded the possibilities of multi-frame reasoning by introducing architectures that process multiple images alongside text prompts.

However, real-world industrial detection scenarios are far more complex than controlled research environments. Problems that seem straightforward with a single frame often lead to various false positives and edge cases in production.

Consider a scene where a person is lying on the floor. Looking at that single moment, it is easy to categorize it as a "collapse." But what if the previous frame showed them stretching, or simply changing posture while working?

In nighttime environments, lens flares, light reflections, or glare can mimic the color patterns of fire, leading to false fire detections when based on a single image. When even humans find it difficult to be certain from a single snapshot, providing a model with only one frame inevitably creates structural limitations.



These cases all share a common problem: a "lack of context."




Time is the Most Powerful Context

Many detection scenarios inherently rely on a temporal flow.

For instance, "loitering" can only be defined by observing a pattern of staying in the same space for a certain period. Similarly, "long-term abandonment" requires the condition that an object remains unchanged for a specific duration after being placed.

Attempting to solve these problems with a single frame is structurally difficult because the focus must be on "change," not just "state."

We have categorized this into three levels of context:

  • Single Image-based Judgment
  • Short-term Multi-image Contextual Judgment (Momentary context)
  • Temporal Judgment (Involving long-term flow)

In actual operating environments, these three levels coexist. Some scenarios are sufficient with a single frame, some require consecutive frames at intervals of a few seconds, and others require tracking a flow over tens of seconds.




EVA's Multi Frame Manager

In EVA, user-defined scenarios are not treated as simple text conditions. The system analyzes the "level of context" required by each scenario and determines an appropriate frame collection strategy.

For example, "fainting detection" requires multi-images covering a few seconds before and after the event, rather than a single frame. In contrast, "long-term abandonment" requires continuous frame collection over a specific duration based on a sliding window.

The module responsible for this process is the Multi Frame Manager. This module dynamically determines the following based on the scenario characteristics:

  • Number of frames required
  • Collection intervals
  • Retention time
  • Event trigger expansion

Collected images are not simply listed. They are delivered to the VLM in a clearly sorted chronological order, accompanied by system prompts that guide the model to compare changes between frames.




Multi-Image Based VLM Inference Strategy

When multi-frame input is received, the VLM does more than just return independent detection results. In EVA, we designed the inference structure to interpret multi-images as a continuous temporal context rather than an independent set of images.

To achieve this, frames are delivered to the model using the following strategies:

  • Chronological Frame Alignment: Constructs time-series data from past to present to understand causality.
  • Comparative System Prompts: Uses instructions like "Identify changes compared to the previous frame" to analyze inter-frame correlations.
  • Temporal Reasoning: Derives logical conclusions based on state changes over time rather than fragmented snapshot judgments.

Case Study: The Power of Temporal Context in Reducing False Positives

The following case demonstrates how fragmented information from a single frame is accurately corrected through the "context" of multiple frames.



  • Single Image: A person is stationary in a low, prone position. A VLM looking only at this moment is highly likely to misinterpret the situation as "Collapse."
  • Multi-Image: In the subsequent frames, subtle movements are captured—the person moves their arms to operate a phone and tilts their head to look at the screen.
  • Result: Through Temporal Reasoning, EVA correctly concludes this is "Sitting and using a phone detected".

The core idea is to guide the model to understand the situation by comparing differences between frames, rather than judging each frame individually.

For high-risk detections like fainting, the model undergoes a process of Progressive Situation Refinement:

  1. Initial State Identification: Identifying the target object and initial visual features (e.g., prone posture).
  2. Dynamic Change Detection: Tracking meaningful changes in body angles or voluntary movements compared to previous frames.
  3. Consistency Verification: Determining if the posture is a forced freeze due to impact or involves intentional actions.
  4. Final Context Determination: Distinguishing between visual noise with similar patterns and actual events.

This Temporal Reasoning structure significantly reduces false positives in edge cases that plague single-image systems, providing much more stable results in real-world operations.


CategorySingle ImageMulti Image
AccuracyPrecisionRecallAccuracyPrecisionRecall
No PPE0.660.870.680.760.870.82
No Mask (Working)0.940.690.540.930.760.52
Loitering0.490.920.330.630.850.64
Fainting0.871.00.360.961.00.82

Ultimately, EVA’s multi-frame inference structure is not just about increasing the number of input images—it is an approach that directly integrates temporal change into the model's reasoning process.




The Cost of Multi-Frame: Computational Overload

Improvements in accuracy come with a price.

While multi-frame reasoning allows for more visual information, it also leads to increased computational costs. In multimodal models, image inputs are generally converted into embeddings via a Vision Encoder before being passed to the LLM, a process that is relatively resource-intensive.

Specifically, multi-frame analysis often encounters the following:

  • Identical or very similar images repeating in a sequence.
  • Multiple requests referencing the same camera frame.
  • Multiple queries performed on the same set of images.

In these cases, if the Vision Encoder processes the same image repeatedly, it creates unnecessary overhead.

In EVA, we developed a structure that maximizes the Encoder Cache feature provided by vLLM to solve this. vLLM offers an Encoder Cache Manager that allows the system to cache and reuse Vision Encoder results during multimodal processing.

By leveraging this, we can reuse previously generated encoder embeddings for identical image inputs, eliminating the need to repeat Vision Encoder operations. EVA applies a request management structure at the Agent Layer to effectively utilize this caching.


The Agent coordinates requests in the following ways:

  • Organizing requests so that identical image inputs can be reused.
  • Managing requests based on image units to enable cache hits.
  • Optimizing request flow to prevent redundant encoding.

This allows us to minimize Vision Encoder operations and utilize GPU resources more efficiently, even in a multi-frame analysis environment.




Conclusion

Multi-frame based VLM inference is an approach that significantly improves situational understanding and detection accuracy compared to single-image analysis.

However, as the number of frames increases, the computational load on the Vision Encoder grows significantly. Therefore, it is crucial to design a system that balances performance gains with computational efficiency and infrastructure costs.

EVA addresses this by actively utilizing vLLM's Encoder Cache and managing requests through the Agent Layer. Through this architecture, we maintain high inference performance while reducing unnecessary computations, continuously improving GPU efficiency and infrastructure operating costs.

This feature is available starting from EVA v2.6.0.

Teaching VLMs to Multitask: Enhancing Situation Awareness through Scenario Decomposition

· 8 min read
Hyunchan Moon
Hyunchan Moon
AI Specialist

At the core of EVA lies the ability to truly understand critical situations that occur simultaneously within a single scene—such as fires, people falling, or traffic accidents—without missing any of them. However, no matter how capable a Vision-Language Model (VLM) is, asking it to reason about too many things at once leads to a sharp degradation in cognitive performance.[2,3]

In this post, inspired by the recent text-to-video retrieval research Q₂E (Query-to-Event Decomposition)[1], we introduce Scenario Decomposition, a technique that enables VLMs to deeply understand complex, multi-scenario situations within a single frame.

Performance Enhancement through Instruction Tuning Based on User Feedback Data

· 12 min read
Jaechan Lee
Jaechan Lee
POSTECH
Yura Shin
Yura Shin
AI Specialist

This work is a collaborative research effort with Minjoon Son (advised by Prof. Youngmyung Ko) as part of the "Campus: Beyond Safety to Intelligence – Postech Living Lab Project with EVA"


🎯 Introduction: Shifting Feedback from 'Retrospective Correction' to 'Cognitive Enhancement'

When EVA makes judgments based on images, operators often provide specific feedback like: "This is indeed a safety vest. Why did it get confused?" or "Shouldn't there be an alert here?" This feedback contains not just the right or wrong answer, but also the human reasoning and context behind the judgment.

Previously, EVA utilized this feedback by storing it in a separate Vector DB and using it to adjust the Alert status when similar situations occurred. While this approach offered the advantage of quick application, it had a structural limitation: it did not improve the model's intrinsic reasoning capability and merely retrospectively filtered errors.

To fundamentally address this issue, we completely changed our approach. We reconstructed user feedback not as simple error reports, but as Instruction Data that the model can directly use in its inference process to strengthen its Visual Reasoning capability.

This article will focus on how VLM-based Instruction Tuning utilizing user feedback data overcomes the limitations of the previous Vector DB-centric approach and improves the model's visual reasoning performance.

From Image to Language, From Language to Reasoning: Boosting VLM Performance with Camera Context

· 7 min read
Minjun Son
Minjun Son
POSTECH
Jisu Kang
Jisu Kang
AI Specialist

This work is a collaborative research effort with Minjoon Son (advised by Prof. Youngmyung Ko) as part of the "Campus: Beyond Safety to Intelligence – Postech Living Lab Project with EVA"


📝 Introduction: Making User Queries Smarter: Enhancing Language with Image Context

EVA is a system that detects anomalies using hundreds to thousands of smart cameras. We utilized VLM/LLM to automatically infer the camera context and embedded this into the prompt, creating a camera-context aware anomaly detection pipeline that reflects the situation of the target image. By leveraging the camera context extracted from a single frame as prior knowledge for the VLLM, we confirmed a meaningful improvement in accuracy and deeper interpretability compared to the existing baseline.

Improving Performance of Intent-Based Chat Command Execution

· 4 min read
Yura Shin
Yura Shin
AI Specialist

Introduction

Users simply send a sentence to a Chat Agent: “Please start monitoring.” “Set the threshold for people to 0.6.” “Add ‘tree’ to the target list.”

While the interaction appears simple, the internal processing required by the LLM is highly complex.

Before taking any action, the LLM must determine the intent:

“Is this a target-setting task? Scenario editing? Or just querying information?”

Then it must:

  • extract required parameters
  • validate values
  • handle errors gracefully and explain what's wrong

Previously, the system attempted to perform all of these steps in a single LLM call.

Although this looked clean on the surface, it repeatedly caused unpredictable and hard-to-debug problems:

  • Wrong task classification → wrong actions executed
  • Rule conflicts between different tasks
  • Incorrect parameter extraction without validation
  • Exponential growth in maintenance due to entangled rules

To solve these core issues, the Chat Agent was redesigned using a LangGraph-based Multi-Node Routing architecture.




1. Even simple requests are “multi-stage decision-making” for LLMs

The previous Chat Agent tried to interpret everything in one LLM call.

For example, the request:

“Change the threshold for ‘tree’ to 0.3”

Internally required the LLM to:

  1. Identify the type of task
  2. Extract parameters (“tree: 0.3”)
  3. Validate the threshold value
  4. Check configuration conflicts
  5. Judge whether modification is allowed
  6. Respond in natural language

Trying to combine all logic into a single prompt and a single set of rules resulted in:

  • Rules for one task affecting others
  • Parameter parsing failures
  • Small changes requiring full prompt rewrites
  • Hard-coded and exploding error handling logic

At peak, the prompt length reached 3,700 tokens, continuously growing and becoming fragile.




2. Fundamental issues in the original architecture

The original LLM call served five roles at once:

  • Task classification
  • Parameter parsing
  • Value validation
  • Error handling
  • Natural language generation

This caused multiple structural issues:


2.1 Task rule conflicts

Target labels must be in English for video detection. But this rule incorrectly applied to scenario descriptions too — forcing English output even for Korean text.

Result: rules interfering across unrelated tasks.


2.2 Unreliable parameter parsing

Even simple numeric interpretation often failed:

  • “one point five” → interpreted as 0.15
  • Word-form numbers or locale-dependent formats → parsing failures

More edge cases → more instability.


2.3 Every error case required manual rule definitions

The LLM handled all error evaluation. Meaning:

  • Every possible error had to be pre-defined
  • Any new parameter → new rules → high maintenance



3. Introducing a Routing-Based Architecture

We rebuilt the system using a 3-Stage LangGraph Routing Pipeline.

Core principle:

One purpose per LLM call. Never ask the LLM to do multiple jobs at once.


3.1 Task Routing Node

“Classify the request — and only that”

No parsing. No validation. No rule application.

Minimal responsibility → maximal reliability.

Uses:

  • Current request text
  • Available task list
  • Existing system state → to pick the correct task.

3.2 Task-Specific Parameter Parser

“Each task has isolated prompts, parsers, and rules”

Previously:

  • All tasks shared the same prompt → rule entanglement

Now:

  • Each task has its own prompt + parser + rules
  • Fully isolated LLM call

Examples:

  • Set-Target Task → dedicated logic only for targets
  • Start-Monitoring Task → independent logic only for monitoring

No more rule collisions or cross-contamination 🎯


3.3 Error Handling Node

“System validates. LLM explains.”

Process:

  • LLM extracts values
  • System Validator confirms correctness
  • If invalid → Error Node generates user-friendly explanation

Example (threshold 1.5):

  • Parser: threshold: 1.5
  • Validator: Out of allowed range
  • Error Node:

    “Threshold must be between 0.0 and 1.0. Please try again.”

LLM no longer decides errors — it only communicates them.




4. Performance Evaluation

Routing-based design didn’t only improve accuracy — it boosted maintainability, stability, and speed.


4.1 Task & Parameter Accuracy

MetricBeforeAfter
Task Routing Accuracy82.3%95.0%
Parameter Parsing Accuracy69.6%95.0%

Huge gain thanks to isolating classification and parsing 🎉


4.2 Prompt Length Reduction

CaseBeforeAfter
Min1,603 tokens1,106 tokens
Max3,783 tokens1,793 tokens

Shorter → more deterministic & reliable LLM reasoning


4.3 Latency Improvement

CaseBeforeAfter
Min1.19 s1.50 s
Max2.98 s2.03 s

Even with more calls, overall latency improved at peak load.




5. Conclusion

Key insight:

The problem wasn’t the LLM — it was how we were using the LLM.

One call doing all tasks → confusion, instability Proper division of roles → stable and predictable performance

Each component now focuses only on its job:

RoleOwner
Task ClassificationRouter
Parameter ParsingTask-specific Parser
ValidationSystem Rules
Error CommunicationLLM (Error Node)

This restructure marks a major milestone — transforming EVA Chat Agent into a trustworthy AI control interface.

A more robust foundation means:

  • Easier expansion
  • More accurate automation
  • Better user experience
  • Lower maintenance cost

From One-Shot Decisions to Two-Stage Reasoning

· 7 min read
Seongwoo Kong
Seongwoo Kong
AI Specialist
Jisu Kang
Jisu Kang
AI Specialist
Keewon Jeong
Keewon Jeong
Solution Architect

Instead of Making a Single Decision, Be Cautious Step-by-Step

The process of AI making a decision from a single camera image is more complex than most people think. Users may simply ask: “Notify me if someone falls down,” “Alert me when a worker isn’t wearing a mask,” But the AI has to: analyze the image, check the requested conditions, consider exceptions, make the final decision, and explain the reasoning — all in a single pass.

In EVA, we introduced an Enriched Input structure that separates the user’s requirements into Detection conditions and Exception conditions, which significantly improved performance. However, even with structured input, the AI still made contradictory judgments in multi-condition scenarios.

The issue was not only about structuring the conditions — but also about forcing the AI to perform multiple judgments all at once. So EVA moved beyond the limitations of the existing one-shot approach and introduced a new Two-Stage Reasoning process.

In this post, we cover:

  • Why structured input alone could not solve the problem
  • The fundamental limits of one-shot decision-making
  • Why AI works better when decisions are split into two stages
  • Performance improvements validated by real experiments

Turning Simple User Requests into AI-Understandable Instructions

· 11 min read
Seongwoo Kong
Seongwoo Kong
AI Specialist
Jisu Kang
Jisu Kang
AI Specialist
Keewon Jeong
Keewon Jeong
Solution Architect

Expanding User Queries So AI Can Clearly Understand Intent

EVA is a system that operates based on user-issued commands. For EVA to make stable and accurate decisions, it is crucial that user requests are delivered in a form that AI can clearly understand.

However, even if the natural language expressions we use daily seem simple and clear to humans, they can be ambiguous from an AI model’s perspective, or they may require excessive implicit reasoning. This gap is exactly what often leads to AI system malfunctions or inaccurate decisions.

To fundamentally address this, EVA uses a Few-Shot prompting technique to automatically expand simple user requests into a structured query representation.

In this post, we focus on:

  • Why simple natural-language requests are difficult for AI
  • How query expansion can improve AI’s understanding
  • How much performance improved in actual field deployments

and share practical methods and their impact for helping AI understand user intent more clearly.

Complete Mastery of vLLM: Optimization for EVA

· 17 min read
Taehoon Park
Taehoon Park
AI Specialist

In this article, we will explore how we optimized LLM service in EVA. We will walk through the adoption of vLLM to serve LLMs tailored for EVA, along with explanations of the core serving techniques.




1. Why Efficient GPU Resource Utilization is Necessary

Most people initially interact with cloud-based LLMs such as GPT / Gemini / Claude. They deliver the best performance available without worrying about model operations — you simply need a URL and an API key. But API usage incurs continuous cost and data must be transmitted externally, introducing security risks for personal or internal corporate data. When usage scales up, a natural question arises:

“Wouldn’t it be better to just deploy the model on our own servers…?”

There are many local LLMs available such as Alibaba’s Qwen and Meta’s LLaMA. As the open-source landscape expands, newer high-performance models are being released at a rapid pace, and the choices are diverse. However, applying them to real services introduces several challenges.

Running an LLM as-is results in very slow inference. This is due to the autoregressive nature of modern LLMs. There are optimizations like KV Cache and Paged Attention that dramatically reduce inference time. Several open-source serving engines implement these ideas — EVA uses vLLM. Each engine differs in model support and ease of use. Let’s explore why EVA chose vLLM.

Eliminating False Positives in Human Detection Using Pose Estimation

· 6 min read
Euisuk Chung
Euisuk Chung
AI Specialist

Introduction

“There’s a person over there!” Our AI vision system confidently reported. Yet all we saw on the screen was an empty chair with a coat draped over it.

Human detection technology has advanced rapidly, but the real world is far more chaotic than polished demo videos. In the environments we focus on, the problem becomes even more noticeable:

  • 🏢 Office: empty chairs with jackets
  • 🔬 Laboratory: lab coats and protective clothing hanging on chairs
  • 💼 Work areas: vacant meeting rooms and lounges

Such false positives aren’t just “slightly wrong” results. They directly degrade system trust and efficiency.

For example:

  • Energy-saving systems may misjudge how many people are present and waste power.
  • Security systems may focus on “phantom personnel” and waste monitoring resources.

Example: an empty chair mistakenly detected as a seated human