EVA x Rebellions: Journey of EVA on NPU
The integration and optimization journey between Mellerikat EVA and Rebellions NPU clearly demonstrates the future direction of next-generation AI infrastructure. Through this project, we verified that NPU-based architectures can address the high cost and power consumption challenges of traditional GPU-centric infrastructures. In particular, in Physical AI environments—where real-time perception and reasoning are critical—we confirmed the potential to achieve both significant TCO (Total Cost of Ownership) reduction and high performance simultaneously.
Today, we would like to share the porting process of moving GPU-based models to NPUs, along with the technical challenges behind it, which many people have been curious about.
1. The NPU Porting Process for GPU Models
Since NPUs are designed to accelerate specific types of computations, newly released models cannot be executed immediately without adaptation. To fully utilize the hardware’s capabilities, several essential steps are required.
-
Model Conversion
The original models developed in PyTorch or TensorFlow must be converted into an executable format that the NPU can understand. Using the ATOM Compiler from Rebellions, the model’s computational graph is analyzed and converted into the
.rblnexecutable format optimized for the NPU architecture. -
NPU-Optimized Compilation
The model is compiled into a hardware-optimized executable using the compiler in the Rebellions SDK (RBLN SDK).
- Graph Optimization: Removes redundant operations and reorganizes the data flow.
- Operator Fusion: Combines multiple small operations into a single large kernel to reduce memory access and execution overhead.
- Data Layout Optimization: Adjusts tensor layouts to match the NPU memory architecture, improving data access efficiency.
-
Quantization
Computational precision is adjusted to match the NPU architecture, improving both performance and memory efficiency. In the case of EVA, we optimized the model to ensure stable performance under an FP16-based inference environment.
-
vLLM Integration and Validation
The optimized model is deployed within the vLLM-RBLN serving framework. Key metrics such as TTFT (Time To First Token) and throughput are measured and validated against GPU-based environments.
2. EVA Application Optimization and Technical Challenges
After porting the foundation model, the next step is deploying the actual service layer—the EVA Application. During this stage, we have been implementing the following optimization roadmap.
-
EVA Vision Optimization (1:1 Mapping & Batching)
We mapped NPU cores and Vision Workers in a 1:1 configuration, eliminating context-switching overhead. In addition, by applying continuous batching techniques, we are building a foundation capable of processing data from hundreds of cameras in real time without latency.
-
EVA Agent Optimization (Reducing VLM Load)
The input resolution of the Vision-Language Model (VLM) was standardized to 1280×720, and a two-stage reasoning architecture was applied to minimize unnecessary VLM calls. This immediately reduces the computational load on the Vision Encoder, which is one of the most expensive components in the pipeline.
-
System Memory Management and KV Cache Optimization
In collaboration with Rebellions, we analyzed the memory usage patterns of vLLM-RBLN instances and improved resource utilization using a page-based memory management structure. This optimization allows the system to process a larger volume of visual data reliably within the same hardware environment.
-
Parallel Processing of the VLM Vision Encoder
We are also improving the parallel execution architecture of the Vision Encoder, which accounts for a large portion of the computation in VLM inference. By optimizing how Vision Encoder operations are distributed across multiple NPU cores, we aim to significantly improve VLM serving throughput.
3. Conclusion: Evolving from PoC to a Production-Ready Solution
We are continuously addressing technical challenges discovered during stress testing while refining optimizations that maximize hardware utilization. From parallel processing of the Vision Encoder through close collaboration with Rebellions to the development of an intelligent scheduler within the EVA platform, every step is part of transforming “EVA on NPU” from a simple proof-of-concept (PoC) into a production-ready solution.
Ultimately, the success of AI services depends on meeting three essential conditions: economic efficiency, scalability, and service quality. EVA will continue to actively adopt the latest NPU technologies and present a global standard for Physical AI platforms—delivering the most competitive TCO and outstanding performance for our customers.




