EVA on Rebellions NPU: An Optimization Journey for Physical AI Services
EVA is a Physical AI platform that detects dangerous situations, security events, and worker safety conditions in real time from camera streams. For EVA to operate reliably in real-world sites, it is not enough for the AI model to simply run. The system must process simultaneous requests from multiple cameras and deliver results within a response time users can actually perceive when events occur.
EVA has operated Vision Model, Vision Language Model, and Agent pipelines in GPU-based environments. To secure higher power efficiency and more cost-efficient scalability, we validated and optimized whether EVA could also run at commercial-service quality in a Rebellions NPU environment.
Running EVA on NPU was not just a model-porting task. To operate a GPU-centric AI service pipeline stably on NPU, we had to optimize model compilation, input resolution, parallel processing, and CPU-NPU resource placement together.
Key Optimization Points Validated for NPU Productionization
| Category | Validation Point | EVA's Optimization Direction |
|---|---|---|
| Model compatibility | GPU-based models need to be transformed for NPU execution structure | Split model structure, fix input shapes, separate preprocessing/postprocessing |
| Input resolution | A balance is required between inference speed and fine-grained visual accuracy | Instead of shrinking the full image, crop required regions and reconstruct at target resolution |
| Detection quality | Confirm whether quality is maintained after NPU porting, quantization, and resolution tuning | Scenario-based performance validation using real operational data |
| Inference structure | As VLM requests increase, Vision Encoder and Decoder load management becomes critical | Decompose detection into smaller tasks and run only required inference |
| Concurrency | A stable execution architecture is needed for many simultaneous camera requests | Per-core worker placement on NPU and distributed multi-vLLM instances |
| Server resource placement | Alignment across CPU, memory, and NPU is critical in multi-instance operation | Apply CPU Pinning and NUMA Alignment |
1. Optimizing Models for NPU Execution Structure
The first question in NPU environments was whether each model could be transformed into a structure suitable for NPU execution. In GPU environments, PyTorch-based models can be run relatively flexibly. In NPU environments, however, model structure and input formats must be explicitly defined through a dedicated compiler.
Because EVA's Vision models handle diverse camera inputs and scenarios, we had to optimize not only model execution itself, but also preprocessing, postprocessing, input shape policy, and coordinate restoration.
To address this, EVA reorganized model structures into NPU-friendly forms, fixed input shapes, and rebuilt image preprocessing and output-to-original-coordinate restoration in line with the EVA pipeline.
This was not merely about placing a model on NPU. It was an optimization effort to provide stable response times and consistent inference outputs in real production environments.
2. Balancing Input Resolution and Fine-Grained Decision Quality
Raising inference speed on NPU requires input-resolution optimization. But if full images are uniformly downscaled across all scenarios, performance can degrade in cases that require fine-grained decisions.
This trade-off is especially important in PPE compliance scenarios that require precise recognition of small regions. Helmets, masks, goggles, and gloves occupy small portions of an image, so downscaling the full frame can weaken critical visual details.
EVA did not solve this by simply raising full-image resolution. Instead, EVA first identifies the region required for decision-making, then crops that region and reconstructs it at the model's required input resolution.
For example, when judging PPE compliance, EVA does not pass the full frame directly to VLM. It crops around the person region and scales that crop to the model input size before inference.
This approach offers several advantages.
- It reduces inference cost by avoiding high-resolution processing of the full image.
- It preserves sufficient resolution in the region the model actually needs to inspect.
- It mitigates accuracy loss in small-object or fine-state judgment.
In other words, EVA does not simply reduce resolution in NPU environments. It distinguishes between regions the model must inspect and regions it does not.
3. Detection Quality Validation with Real Operational Data
For NPU productionization, speed alone is not enough. Detection quality must be validated as well. Even if a model runs stably on NPU, deployment requires confirming that decision quality is maintained relative to GPU environments.
EVA built its own validation dataset from real detection data collected across operational environments. Rather than relying on public benchmarks, we validated performance against the scenarios EVA actually needs to judge.
The validation set included field scenarios such as the following.
| Scenario Type | Examples |
|---|---|
| Safety | No helmet, no mask, no gloves, fall detection |
| Security | Loitering, fence crossing, access/presence verification |
| Equipment/Operations | Workers around forklifts, equipment contact, load detection |
| Disaster/Environment | Fire, smoke/flame, oil spills on floors |
| Vehicle | Emergency vehicle detection, police vehicle detection |
Using this data, we validated identical scenarios in both GPU and NPU environments. As a result, we confirmed that NPU could maintain quality at a level similar to GPU across major scenarios.
That said, scenarios requiring fine visual detail for small objects remain sensitive to resolution changes, so combining this with the crop-based input optimization described above is important.
4. Task Decomposition to Improve VLM Call Efficiency
In NPU environments, it is often more effective to filter requests that truly require high-cost inference than to process everything as one large inference task. Since VLM combines a Vision Encoder and an LLM Decoder, request volume and execution stages must be managed systematically when many cameras generate requests at once.
EVA does not run VLM on every frame. VM first checks object existence and baseline conditions, and VLM runs only when needed.
For example, in PPE non-compliance detection, frames without a person are filtered out early. Follow-up reasoning runs only when a person region is confirmed. Complex scenarios are also split into smaller tasks such as detection-stage decisions, exception checks, and alert-message generation, rather than handled as one large request.
This lets EVA manage VLM calls efficiently on NPU and concentrate compute resources on requests that actually require semantic reasoning.
5. Distributed Multi-NPU Workers and vLLM Instances
To run EVA as a commercial service on NPU, the system must stably handle simultaneous Vision and VLM requests from many cameras. EVA therefore separated Vision and Agent domains, and distributed NPU cores and vLLM instances by role.
| Domain | Workload | NPU Optimization Direction |
|---|---|---|
| Vision Worker | Object-detection requests | Worker placement by NPU core |
| vLLM Instance | VLM-based situation understanding | Distributed multi-instance configuration |
| Vision Encoder | Image input processing | Load mitigation through request distribution |
| Agent Pipeline | VLM inference request control | Forward only required requests to VLM |
In the Vision domain, object-detection workers are mapped per NPU core so many camera requests can run in parallel. Worker counts are tuned by camera volume and per-model request load so NPU cores are used stably.
In the VLM domain, requests are not concentrated into a single vLLM instance. Multiple vLLM instances are distributed to secure concurrent throughput. This is especially important for balancing load in the Vision Encoder path.
In short, EVA's NPU optimization is not just about adding more NPU hardware. It is about placing Vision workers and vLLM instances according to NPU architecture to improve end-to-end inference pipeline stability.
6. NUMA Alignment and CPU Pinning
NUMA stands for Non-Uniform Memory Access, a memory architecture where access latency varies by CPU socket and memory locality.
In NPU optimization, server resource placement is as important as models and pipelines. In real production environments, multiple vLLM instances run on a single server. Which CPU cores each process uses, which NUMA node those cores belong to, and how physically close they are to the NPU can all affect end-to-end response time.
In multi-instance environments, it is critical to align data movement paths across CPU, memory, and NPU clearly. For this reason, EVA validated a structure that applies CPU Pinning and NUMA Alignment per vLLM instance.
| Item | Description |
|---|---|
| CPU Pinning | Fix CPU cores used by each process |
| NUMA Alignment | Align CPU/memory usage to the NUMA node nearest the NPU |
| Container isolation | Isolate vLLM instances by Docker container or Kubernetes Pod |
| Expected effect | Reduced cross-instance interference and lower data movement cost |
In final production environments, each vLLM should run in an isolated container or pod, with CPU Pinning and NUMA Alignment applied per deployment.
This is not simple server tuning. It is an infrastructure-level optimization required for stable NPU-based AI service operation.
Closing
Throughout this optimization journey, EVA did not treat NPU as a simple GPU replacement. To run a GPU-centric AI service pipeline stably on NPU, we had to jointly optimize model structure, input pipelines, inference stages, worker placement, vLLM instance configuration, and server resource alignment.
The key elements optimized together for stable EVA operation on NPU were:
- NPU-aligned model structures
- Fixed input pipelines
- Crop-based input optimization to offset resolution-loss effects
- Detection-quality validation based on real operational data
- Distributed processing architecture across Vision and VLM
- Resource alignment based on CPU Pinning and NUMA Alignment
Through this process, EVA secured a structure that maintains performance in major detection scenarios while handling many concurrent camera requests stably in NPU environments.
Going forward, EVA will continue advancing Vision Encoder parallelization, model quantization, multi-NPU scheduling, and container-based resource isolation in Rebellions NPU environments, expanding into a Physical AI platform that runs reliably across diverse hardware infrastructures.
