Eliminating False Positives in Human Detection Using Pose Estimation

November 27, 2025 · 6 min read

Euisuk Chung

AI Specialist

Introduction

“There’s a person over there!” Our AI vision system confidently reported. Yet all we saw on the screen was an empty chair with a coat draped over it.

Human detection technology has advanced rapidly, but the real world is far more chaotic than polished demo videos. In the environments we focus on, the problem becomes even more noticeable:

🏢 Office: empty chairs with jackets
🔬 Laboratory: lab coats and protective clothing hanging on chairs
💼 Work areas: vacant meeting rooms and lounges

Such false positives aren’t just “slightly wrong” results. They directly degrade system trust and efficiency.

For example:

Energy-saving systems may misjudge how many people are present and waste power.
Security systems may focus on “phantom personnel” and waste monitoring resources.

Example: an empty chair mistakenly detected as a seated human

False Positive Challenges in VLM-based Vision Systems

In intelligent video analytics powered by Vision-Language Models (VLMs), false human detection is a frequent and critical issue.

Here are the most common patterns:

Pattern 1: Empty chairs detected as people

Detector misinterprets chair backrests as human torsos
Bias learned from data: “people usually sit on chairs”
Incorrectly passes the result to VLM → “A person is working at the desk”

Pattern 2: Coats or clothing mistaken as humans

Thick coats or lab coats hanging on chairs appear human-shaped
Model overfits pattern: “humans wear clothes”
VLM → “A staff member is standing here” (but it’s just a coat!)

Pattern 3: Mannequins or posters

Similar silhouettes or textures to actual people
No context to distinguish static models from humans

We needed a fundamental solution.

To address this, EVA introduces a verification stage between detection and VLM: Pose Estimation. Below, we share how we tackled the problem and the impact.

From Detection to Understanding

Let’s start by looking at the entire pipeline.

Intelligent Vision Analysis Pipeline

EVA’s vision solution operates in three core stages:

[Stage 1: Object Detection]
Image/Video → OmDet-Turbo → Detect object classes & locations
↓
[Stage 2: Verification & Filtering] ← 🎯 Focus of this article!
Only detected “person” objects → Verified with Pose Estimation
↓
[Stage 3: High-level Understanding]
Verified detections → VLM analysis → Semantic insight & response
↓
"Two customers are browsing the products."

Why Intermediate Filtering Matters

VLMs are powerful but expensive (compute-wise). More importantly: garbage in → garbage out.

Even the best VLM will produce flawed results when fed incorrect detections.

Thus, before anything reaches the VLM, we must ensure:

Goal	Description
Accuracy	Remove objects that are not real humans
Efficiency	Reduce unnecessary VLM calls to save GPU resources

Pose Estimation allows us to accomplish both simultaneously.

Without filtering?

Scenario: retail store monitoring

❌ Without Pose Verification:
Detected → "3 humans"
→ VLM: "3 customers in the store"
→ Actual: 1 customer + 2 mannequins
→ Wrong analytics + wasted compute

✅ With Pose Verification:
Pose check → Only 1 object has valid human joints
→ VLM analyzes only the real human
→ Correct analytics + reduced GPU cost

Key insight: Verify cheaply before processing expensively.

What Is Pose Estimation?

Pose Estimation identifies human joints (keypoints) in images or videos.

It’s like seeing the skeletal structure behind the person.

Source: https://supervisely.com/blog/human-pose-estimation/

Human image → AI → Detect 17 keypoints:
Nose, Eyes, Ears, Shoulders, Elbows, Wrists,
Hips, Knees, Ankles (left & right each)

These 17 keypoints form the structure of a real human body.

The difference between humans and false positives

Target	Visual Similarity	Keypoints	Result
🟢 Real human	High	Strong & complete	Valid
🔴 Chair	Medium	None or fragmented	Invalid
🔴 Chair + coat	High	Unnatural pattern	Invalid
🔴 Lab coat	High	Misaligned & low quality	Invalid

Humans have joints. False positives don’t.

Making Sure It’s Really a Human

Limitations of the original workflow

Previously:

Image → Object Detection → If “person” → VLM

The model classified anything that looked human-shaped as a person.

Real failing examples

Example: Chair backrest ≈ torso

Chair backrest → misread as head and torso
Chair legs → mistaken as human legs
→ Detected as sitting person ❌

Example: Coat on chair

Coat torso shape → detected as human body
Sleeves → detected as arms
→ VLM: “A staff member is working”
→ Reality: Just a chair ❌

Our revised approach: verify before you trust

[1] OD → Find potential humans
[2] Pose Estimation → Confirm real humans only
[3] VLM → High-level scene understanding

VitPose++ was selected due to its:

Strong accuracy in diverse postures
Lightweight small version enabling real-time inference
Robust reliability in unpredictable on-site environments

Real-world Results

Before & After Comparison

Before: (OD only)
Detected: 28 persons
Actual: 15 persons
False Positive Rate: 46%
→ VLM: "28 people working"

After: (OD + Pose Estimation)
Detected = Actual: 15 persons
False Positive Rate: 0%
→ VLM: "15 people working"

No more chairs counted as humans. No more coats pretending to be employees.

Accuracy improved without performance loss

Overall detection + verify → ~25ms/frame
Real-time at 40 FPS
Pose skipped when no humans detected
Batch processing keeps latency low

Service Improvements

Use Case	Problem Before	After
Office occupancy	Coats & chairs counted as humans	Accuracy jumped from 46% → 94%
Lab safety	“Ghost researchers” detected	Emergency response accuracy increased
Meeting room usage	Empty seats counted as attendees	True utilization tracked

Accuracy up. GPU cost down. Trust level up.

Technical Highlights

A VLM’s performance is only as good as its input quality. By introducing a lightweight verification layer, we enabled:

+35% VLM answer accuracy
-20% processing time
Massive drop in noisy detections

Small change, huge impact: A lightweight filter supercharges a heavyweight model.

Challenges & How We Overcame Them

Challenge	Issue	Solution
Occlusion	Partial humans filtered out	Optimal threshold: ≥ 4 keypoints
Data bias	Clothes/chairs mistaken for humans	Keypoint confidence check
Performance	More models = slower?	VitPose-small + batch processing

We turned real-world complexity into a reliable configuration.

Lessons Learned & What’s Next

“The real world is far simpler—and far more complex—than we expect.”

We learned that:

Testing in real environments always matters more than theory
The natural posture of humans is the strongest clue
Real AI engineering is about designing the right pipeline, not stacking models

We’ll continue refining our system based on field insights and share more improvements along the way.

Conclusion

"An AI pipeline is only as strong as its weakest link."

Even top-tier VLMs fail when fed incorrect detections. By validating potential humans before VLM processing, we improved:

Reliability
Efficiency
User trust
Operational value

Not by adding more models, but by ensuring each component does the right job.

Introduction​

False Positive Challenges in VLM-based Vision Systems​

From Detection to Understanding​

Intelligent Vision Analysis Pipeline​

Why Intermediate Filtering Matters​

Without filtering?​

What Is Pose Estimation?​

The difference between humans and false positives​

Making Sure It’s Really a Human​

Limitations of the original workflow​

Real failing examples​

Our revised approach: verify before you trust​

Real-world Results​

Before & After Comparison​

Accuracy improved without performance loss​

Service Improvements​

Technical Highlights​

Challenges & How We Overcame Them​

Lessons Learned & What’s Next​

Conclusion​