EVA on Rebellions NPU: An Optimization Journey for Physical AI Services

June 15, 2026 · 8 min read

Gyulim Gu

Tech Leader

EVA is a Physical AI platform that detects dangerous situations, security events, and worker safety conditions in real time from camera streams. For EVA to operate reliably in real-world sites, it is not enough for the AI model to simply run. The system must process simultaneous requests from multiple cameras and deliver results within a response time users can actually perceive when events occur.

EVA has operated Vision Model, Vision Language Model, and Agent pipelines in GPU-based environments. To secure higher power efficiency and more cost-efficient scalability, we validated and optimized whether EVA could also run at commercial-service quality in a Rebellions NPU environment.

Running EVA on NPU was not just a model-porting task. To operate a GPU-centric AI service pipeline stably on NPU, we had to optimize model compilation, input resolution, parallel processing, and CPU-NPU resource placement together.

Key Optimization Points Validated for NPU Productionization

Category	Validation Point	EVA's Optimization Direction
Model compatibility	GPU-based models need to be transformed for NPU execution structure	Split model structure, fix input shapes, separate preprocessing/postprocessing
Input resolution	A balance is required between inference speed and fine-grained visual accuracy	Instead of shrinking the full image, crop required regions and reconstruct at target resolution
Detection quality	Confirm whether quality is maintained after NPU porting, quantization, and resolution tuning	Scenario-based performance validation using real operational data
Inference structure	As VLM requests increase, Vision Encoder and Decoder load management becomes critical	Decompose detection into smaller tasks and run only required inference
Concurrency	A stable execution architecture is needed for many simultaneous camera requests	Per-core worker placement on NPU and distributed multi-vLLM instances
Server resource placement	Alignment across CPU, memory, and NPU is critical in multi-instance operation	Apply CPU Pinning and NUMA Alignment

1. Optimizing Models for NPU Execution Structure

The first question in NPU environments was whether each model could be transformed into a structure suitable for NPU execution. In GPU environments, PyTorch-based models can be run relatively flexibly. In NPU environments, however, model structure and input formats must be explicitly defined through a dedicated compiler.

Because EVA's Vision models handle diverse camera inputs and scenarios, we had to optimize not only model execution itself, but also preprocessing, postprocessing, input shape policy, and coordinate restoration.

To address this, EVA reorganized model structures into NPU-friendly forms, fixed input shapes, and rebuilt image preprocessing and output-to-original-coordinate restoration in line with the EVA pipeline.

This was not merely about placing a model on NPU. It was an optimization effort to provide stable response times and consistent inference outputs in real production environments.

2. Balancing Input Resolution and Fine-Grained Decision Quality

Raising inference speed on NPU requires input-resolution optimization. But if full images are uniformly downscaled across all scenarios, performance can degrade in cases that require fine-grained decisions.

This trade-off is especially important in PPE compliance scenarios that require precise recognition of small regions. Helmets, masks, goggles, and gloves occupy small portions of an image, so downscaling the full frame can weaken critical visual details.

EVA did not solve this by simply raising full-image resolution. Instead, EVA first identifies the region required for decision-making, then crops that region and reconstructs it at the model's required input resolution.

For example, when judging PPE compliance, EVA does not pass the full frame directly to VLM. It crops around the person region and scales that crop to the model input size before inference.

This approach offers several advantages.

It reduces inference cost by avoiding high-resolution processing of the full image.
It preserves sufficient resolution in the region the model actually needs to inspect.
It mitigates accuracy loss in small-object or fine-state judgment.

In other words, EVA does not simply reduce resolution in NPU environments. It distinguishes between regions the model must inspect and regions it does not.

3. Detection Quality Validation with Real Operational Data

For NPU productionization, speed alone is not enough. Detection quality must be validated as well. Even if a model runs stably on NPU, deployment requires confirming that decision quality is maintained relative to GPU environments.

EVA built its own validation dataset from real detection data collected across operational environments. Rather than relying on public benchmarks, we validated performance against the scenarios EVA actually needs to judge.

The validation set included field scenarios such as the following.

Scenario Type	Examples
Safety	No helmet, no mask, no gloves, fall detection
Security	Loitering, fence crossing, access/presence verification
Equipment/Operations	Workers around forklifts, equipment contact, load detection
Disaster/Environment	Fire, smoke/flame, oil spills on floors
Vehicle	Emergency vehicle detection, police vehicle detection

Using this data, we validated identical scenarios in both GPU and NPU environments. As a result, we confirmed that NPU could maintain quality at a level similar to GPU across major scenarios.

That said, scenarios requiring fine visual detail for small objects remain sensitive to resolution changes, so combining this with the crop-based input optimization described above is important.

4. Task Decomposition to Improve VLM Call Efficiency

In NPU environments, it is often more effective to filter requests that truly require high-cost inference than to process everything as one large inference task. Since VLM combines a Vision Encoder and an LLM Decoder, request volume and execution stages must be managed systematically when many cameras generate requests at once.

EVA does not run VLM on every frame. VM first checks object existence and baseline conditions, and VLM runs only when needed.

For example, in PPE non-compliance detection, frames without a person are filtered out early. Follow-up reasoning runs only when a person region is confirmed. Complex scenarios are also split into smaller tasks such as detection-stage decisions, exception checks, and alert-message generation, rather than handled as one large request.

This lets EVA manage VLM calls efficiently on NPU and concentrate compute resources on requests that actually require semantic reasoning.

5. Distributed Multi-NPU Workers and vLLM Instances

To run EVA as a commercial service on NPU, the system must stably handle simultaneous Vision and VLM requests from many cameras. EVA therefore separated Vision and Agent domains, and distributed NPU cores and vLLM instances by role.

Domain	Workload	NPU Optimization Direction
Vision Worker	Object-detection requests	Worker placement by NPU core
vLLM Instance	VLM-based situation understanding	Distributed multi-instance configuration
Vision Encoder	Image input processing	Load mitigation through request distribution
Agent Pipeline	VLM inference request control	Forward only required requests to VLM

In the Vision domain, object-detection workers are mapped per NPU core so many camera requests can run in parallel. Worker counts are tuned by camera volume and per-model request load so NPU cores are used stably.

In the VLM domain, requests are not concentrated into a single vLLM instance. Multiple vLLM instances are distributed to secure concurrent throughput. This is especially important for balancing load in the Vision Encoder path.

In short, EVA's NPU optimization is not just about adding more NPU hardware. It is about placing Vision workers and vLLM instances according to NPU architecture to improve end-to-end inference pipeline stability.

6. NUMA Alignment and CPU Pinning

NUMA stands for Non-Uniform Memory Access, a memory architecture where access latency varies by CPU socket and memory locality.

In NPU optimization, server resource placement is as important as models and pipelines. In real production environments, multiple vLLM instances run on a single server. Which CPU cores each process uses, which NUMA node those cores belong to, and how physically close they are to the NPU can all affect end-to-end response time.

In multi-instance environments, it is critical to align data movement paths across CPU, memory, and NPU clearly. For this reason, EVA validated a structure that applies CPU Pinning and NUMA Alignment per vLLM instance.

Item	Description
CPU Pinning	Fix CPU cores used by each process
NUMA Alignment	Align CPU/memory usage to the NUMA node nearest the NPU
Container isolation	Isolate vLLM instances by Docker container or Kubernetes Pod
Expected effect	Reduced cross-instance interference and lower data movement cost

In final production environments, each vLLM should run in an isolated container or pod, with CPU Pinning and NUMA Alignment applied per deployment.

This is not simple server tuning. It is an infrastructure-level optimization required for stable NPU-based AI service operation.

Closing

Throughout this optimization journey, EVA did not treat NPU as a simple GPU replacement. To run a GPU-centric AI service pipeline stably on NPU, we had to jointly optimize model structure, input pipelines, inference stages, worker placement, vLLM instance configuration, and server resource alignment.

The key elements optimized together for stable EVA operation on NPU were:

NPU-aligned model structures
Fixed input pipelines
Crop-based input optimization to offset resolution-loss effects
Detection-quality validation based on real operational data
Distributed processing architecture across Vision and VLM
Resource alignment based on CPU Pinning and NUMA Alignment

Through this process, EVA secured a structure that maintains performance in major detection scenarios while handling many concurrent camera requests stably in NPU environments.

Going forward, EVA will continue advancing Vision Encoder parallelization, model quantization, multi-NPU scheduling, and container-based resource isolation in Rebellions NPU environments, expanding into a Physical AI platform that runs reliably across diverse hardware infrastructures.

EVA on Rebellions NPU: An Optimization Journey for Physical AI Services

Key Optimization Points Validated for NPU Productionization

1. Optimizing Models for NPU Execution Structure

2. Balancing Input Resolution and Fine-Grained Decision Quality

3. Detection Quality Validation with Real Operational Data

4. Task Decomposition to Improve VLM Call Efficiency

5. Distributed Multi-NPU Workers and vLLM Instances

6. NUMA Alignment and CPU Pinning

Closing

다음 내용 읽기

Optimizing Detection Operations with Meta Agent

Thinking Mode for More Accurate Detection of Immediately Discernible Hazards

More Accurate PPE Violation Detection with PPE Mode

Start Intellectually Monitoring Your Site with EVA

No complex hardware setups. Just connect your cameras and begin.

Key Optimization Points Validated for NPU Productionization​

1. Optimizing Models for NPU Execution Structure​

2. Balancing Input Resolution and Fine-Grained Decision Quality​

3. Detection Quality Validation with Real Operational Data​

4. Task Decomposition to Improve VLM Call Efficiency​

5. Distributed Multi-NPU Workers and vLLM Instances​

6. NUMA Alignment and CPU Pinning​

Closing​

다음 내용 읽기

Optimizing Detection Operations with Meta Agent

Thinking Mode for More Accurate Detection of Immediately Discernible Hazards

More Accurate PPE Violation Detection with PPE Mode

Start Intellectually Monitoring Your Site with EVA

No complex hardware setups. Just connect your cameras and begin.

Key Optimization Points Validated for NPU Productionization

1. Optimizing Models for NPU Execution Structure

2. Balancing Input Resolution and Fine-Grained Decision Quality

3. Detection Quality Validation with Real Operational Data

4. Task Decomposition to Improve VLM Call Efficiency

5. Distributed Multi-NPU Workers and vLLM Instances

6. NUMA Alignment and CPU Pinning

Closing