EVA GPU MIG/MPS Optimization Guide

June 22, 2026 · 10 min read

Gyulim Gu

Tech Leader

Seongwoo Kong

AI Specialist

Hyunchan Moon

AI Specialist

Jinhong Min

AI Specialist

EVA optimizes not only model inference, but also GPU partitioning and process execution strategies to reliably operate Vision Models and VLMs in large-scale camera environments. What matters in this process is not simply using a more powerful GPU. When Vision Models and VLMs run together, actual service performance can vary significantly depending on how GPU resources are partitioned and how multiple inference processes are executed.

EVA does not rely solely on model optimization or application-level scheduling. It determines the optimal configuration for each deployment environment by considering the number of GPUs installed in the server, GPU memory capacity, MIG partitioning availability, MPS effectiveness, and the placement of Vision Workers and vLLM instances.

In other words, EVA is not just a service that runs AI models. It maximizes system resource efficiency by considering hardware-level GPU configuration and the behavior of the Serving Framework. This allows EVA to reliably process requests from many cameras, even with limited server resources.

In this article, we compare the effects of MIG and MPS based on actual EVA experiment data from the following three perspectives.

MIG effectiveness on a multi-GPU server: PRO 5000 x3
MIG effectiveness on a single-GPU server: PRO 6000 x1
MPS effectiveness in an environment with many Vision Workers

Through this analysis, we aim to provide practical criteria for determining which MIG/MPS configuration is most suitable for EVA operation, rather than applying MIG or MPS unconditionally.

MIG(Multi-Instance GPU): A feature that partitions a single physical GPU into multiple independent GPU instances. It can reduce resource contention by placing Vision and vLLM workloads on separate instances.
MPS(Multi-Process Service): A feature that allows multiple CUDA process requests to be coordinated through a single server process. It can reduce context-switching overhead in multi-process environments.

1. EVA Inference Architecture

An EVA server runs two major inference layers at the same time.

Vision: Processes various object detection models such as RT-DETRV2, Owl-v2, OmDet, and LLMDet in parallel through multiple Worker processes.
vLLM(VLM Serving): Receives Agent requests, decomposes user-defined scenarios into multiple tasks, and determines whether the scenario condition is met through multi-step inference.

The key point in EVA’s inference architecture is that Vision and vLLM use the GPU in different ways.

Category	GPU Usage Pattern
Vision	Many Workers frequently generate short inference requests
vLLM	Processes relatively large inference workloads through continuous batching
Mixed execution	When Vision and vLLM share the same GPU, context-switching overhead can accumulate

For models used by many cameras, EVA assigns more Workers to those models so that inference requests can be processed in parallel by model type. On the other hand, vLLM processes multiple requests together through continuous batching, so simply increasing the number of instances does not always lead to higher throughput.

Therefore, EVA determines whether to apply MIG and MPS based on the server configuration and workload characteristics.

2. Experiment Environment

2.1 Server Configuration

The experiment was conducted using the following two GPU server configurations.

Server	GPU Configuration	MIG Configuration
Server A	RTX PRO 5000 48GB x 3	24GB x 6
Server B	RTX PRO 6000 96GB x 1	24GB x 4

2.2 Service Placement

Because Vision and vLLM have different GPU usage patterns, it is important to determine which GPU or MIG instance each workload should be placed on.

In general, Vision is configured as the area responsible for object detection requests, while vLLM is configured as the area responsible for Agent VLM-based reasoning requests. When MIG is applied, each remaining GPU resource or MIG slice, excluding the GPU or MIG slice used by Vision, is assigned one vLLM instance.

For example, when MIG is applied on the PRO 5000 x3 server, six 24GB MIG slices are created. One slice is assigned to Vision, and one vLLM instance is placed on each of the remaining five slices. The same approach is used on the PRO 6000 x1 server: among four 24GB MIG slices, one is assigned to Vision and the remaining three are assigned to vLLM instances.

Environment	MIG	Placement	Description
PRO 5000 x3	X	Vision / vLLM / vLLM	Among three physical GPUs, one is used for Vision and the other two are used for vLLM instances
PRO 5000 x3	O	Vision / vLLM / vLLM / vLLM / vLLM / vLLM	Among six 24GB MIG slices, one is used for Vision and the remaining five are used for vLLM instances
PRO 6000 x1	X	Vision + vLLM	Vision and vLLM run together on a single 96GB GPU
PRO 6000 x1	O	Vision / vLLM / vLLM / vLLM	Among four 24GB MIG slices, one is used for Vision and the remaining three are used for vLLM instances

This configuration was designed to evaluate whether MIG can improve actual throughput when Vision and vLLM are separated as much as possible and vLLM instances are evenly placed on the remaining GPU resources.

3. Metric Definitions

In this article, Vision and Agent throughput are compared using the following metrics.

Metric	Definition
Vision throughput	`req/s`
Agent throughput	`req/min`

The conversion formulas are as follows.

Vision req/s = Total Vision requests processed in 1 hour / 3600
Agent req/min = Total VLM responses in 1 hour / 60

4. MIG Effect on PRO 5000 x3

4.1 Measurement Results

MIG	MPS	VLM responses	VLM Latency	Vision Throughput (`req/s`)	Agent Throughput (`req/min`)
X	X	2,287	10.39 s	29.36	38.11
O	O	2,229	10.84 s	22.63	37.15

4.2 Interpretation

In the multi-GPU PRO 5000 x3 configuration, increasing the number of vLLM instances through MIG did not produce a meaningful improvement in vLLM throughput.

Because vLLM was already efficiently handling concurrent requests through continuous batching, increasing the number of instances did not directly translate into higher throughput. In addition, the reduced available resources per instance caused by MIG partitioning and the overall workload placement changes appear to have contributed to the decrease in Vision throughput.

From an operational perspective, the following factors should also be considered when applying MIG.

MIG partitioning policy management
Instance-level monitoring
Reassignment and recovery procedures in case of failure
Workload-specific instance size adjustment

Therefore, in a multi-GPU environment such as PRO 5000 x3, it is more appropriate to start without MIG by default and consider MIG only when resource contention between Vision and vLLM is clearly identified.

5. MIG Effect on PRO 6000 x1

5.1 Measurement Results

MIG	MPS	VLM responses	VLM Latency	Vision Throughput (`req/s`)	Agent Throughput (`req/min`)
X	X	712	47.20 s	20.33	11.87
O	O	1,032	32.10 s	26.33	17.20

5.2 Interpretation

In the single-GPU PRO 6000 x1 configuration, the effect of MIG was clearly observed.

Without MIG, Vision Workers and vLLM share a single physical GPU. In this case, many Vision Workers repeatedly generate short inference requests, while vLLM processes relatively large inference workloads. As a result, GPU ownership can switch frequently between workloads.

By applying MIG, the GPU resources used by Vision and vLLM are isolated into hardware-level independent instances. This reduces resource contention between workloads and allows each inference pipeline to run more stably.

Metric	MIG Disabled	MIG Enabled	Change
VLM responses	712	1,032	+44.9%
VLM Latency	47.20 s	32.10 s	-32.0%
Agent Throughput	11.87 req/min	17.20 req/min	+44.9%
Vision Throughput	20.33 req/s	26.33 req/s	+29.5%

These results show that MIG can be an effective option in high-density environments where Vision and vLLM must run together on a single GPU. In particular, the more different the GPU usage patterns of Vision and vLLM are, the greater the benefit of hardware-level isolation through MIG can be.

6. MPS Effect in an Environment with Many Vision Workers

Vision models run through multiple Worker processes that send requests to the GPU at the same time. When the number of Workers increases, GPU ownership can switch frequently between processes.

By applying MPS, multiple CUDA process requests can be coordinated through a single MPS server. This can reduce context-switching overhead and improve GPU utilization in multi-process environments.

In this experiment, we compared the effect of MPS on the PRO 5000 environment without applying MIG.

MIG	MPS	Total requests	Total throughput	RT-DETRV2	Owl-v2	OmDet	LLMDet
X	X	7,131	23.770 req/s	4.337 req/s	5.597 req/s	10.953 req/s	2.883 req/s
X	O	7,794	25.980 req/s	4.490 req/s	7.447 req/s	11.120 req/s	2.923 req/s

With MPS enabled, total Vision throughput increased by approximately 9.3%, from 23.770 req/s to 25.980 req/s.

Among the models, Owl-v2 showed the largest throughput improvement, while RT-DETRV2, OmDet, and LLMDet also showed slight improvements. This indicates that MPS can help coordinate multi-process requests more stably in environments with many Vision Workers.

7. Final Conclusion

This experiment confirmed that MIG and MPS are not features that should be applied uniformly in every environment. Instead, they should be applied selectively depending on the GPU configuration and workload characteristics.

7.1 Multi-GPU Server: PRO 5000 x3

In an environment such as PRO 5000 x3, where multiple physical GPUs are available, Vision and vLLM can be separated at the physical GPU level. In this case, additionally applying MIG to increase the number of vLLM instances showed limited throughput improvement.

vLLM already handles concurrent requests efficiently through continuous batching
Increasing the number of instances does not directly lead to higher throughput
MIG partitioning can increase operational complexity and resource fragmentation
The default recommendation is to start without MIG

7.2 Single-GPU Server: PRO 6000 x1

In an environment such as PRO 6000 x1, where Vision and vLLM must run together on a single physical GPU, MIG showed a significant effect.

Vision and vLLM have different GPU usage patterns
Sharing a single GPU can increase context-switching overhead
MIG reduces resource contention between workloads
MIG should be considered first for high-density single-GPU configurations

7.3 Environment with Many Vision Workers

MPS can be effective in environments with many Vision Workers.

Many Workers generate GPU requests concurrently
GPU ownership switching between processes can increase overhead
MPS increased total Vision throughput by approximately 9.3%
MPS should be considered as a default option for Vision-heavy servers

8. Operational Recommendations

Operating Environment	Recommended Configuration
Multi-GPU server focused on vLLM	Start without MIG and introduce MIG only when a clear bottleneck is identified
Single-GPU server with mixed Vision/VLM workloads	Consider MIG first
Server with many Vision Workers	Consider applying MPS
Server where Vision and vLLM can be separated by physical GPU	Prioritize physical GPU-level separation
Server with insufficient GPU memory	Prioritize model placement and Worker count adjustment before MIG

In summary, EVA does not treat MIG and MPS as simple on/off features. It selects the optimal configuration for each environment by considering server architecture, Vision/VLM placement, Worker count, vLLM execution behavior, and operational complexity.

9. References

The following materials were referenced when organizing the analysis framework for this article.

NVIDIA Technical Blog: Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/
NVIDIA Technical Blog: Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS https://developer.nvidia.com/blog/boost-gpu-memory-performance-with-no-code-changes-using-nvidia-cuda-mps/
NVIDIA Documentation: Multi-Process Service (MPS) https://docs.nvidia.com/deploy/mps/latest/index.html
vLLM Official Blog https://vllm.ai/blog
Anyscale Technical Blog: vLLM Throughput Analysis Based on Continuous Batching https://www.anyscale.com/blog/continuous-batching-llm-inference

EVA GPU MIG/MPS Optimization Guide

1. EVA Inference Architecture

2. Experiment Environment

2.1 Server Configuration

2.2 Service Placement

3. Metric Definitions

4. MIG Effect on PRO 5000 x3

4.1 Measurement Results

4.2 Interpretation

5. MIG Effect on PRO 6000 x1

5.1 Measurement Results

5.2 Interpretation

6. MPS Effect in an Environment with Many Vision Workers

7. Final Conclusion

7.1 Multi-GPU Server: PRO 5000 x3

7.2 Single-GPU Server: PRO 6000 x1

7.3 Environment with Many Vision Workers

8. Operational Recommendations

9. References

다음 내용 읽기

Infrastructure Optimization for Supporting Large-Scale Camera Environments in EVA

EVA on Rebellions NPU: An Optimization Journey for Physical AI Services

Optimizing Detection Operations with Meta Agent

Start Intellectually Monitoring Your Site with EVA

No complex hardware setups. Just connect your cameras and begin.

1. EVA Inference Architecture​

2. Experiment Environment​

2.1 Server Configuration​

2.2 Service Placement​

3. Metric Definitions​

4. MIG Effect on PRO 5000 x3​

4.1 Measurement Results​

4.2 Interpretation​

5. MIG Effect on PRO 6000 x1​

5.1 Measurement Results​

5.2 Interpretation​

6. MPS Effect in an Environment with Many Vision Workers​

7. Final Conclusion​

7.1 Multi-GPU Server: PRO 5000 x3​

7.2 Single-GPU Server: PRO 6000 x1​

7.3 Environment with Many Vision Workers​

8. Operational Recommendations​

9. References​

다음 내용 읽기

Infrastructure Optimization for Supporting Large-Scale Camera Environments in EVA

EVA on Rebellions NPU: An Optimization Journey for Physical AI Services

Optimizing Detection Operations with Meta Agent

Start Intellectually Monitoring Your Site with EVA

No complex hardware setups. Just connect your cameras and begin.

1. EVA Inference Architecture

2. Experiment Environment

2.1 Server Configuration

2.2 Service Placement

3. Metric Definitions

4. MIG Effect on PRO 5000 x3

4.1 Measurement Results

4.2 Interpretation

5. MIG Effect on PRO 6000 x1

5.1 Measurement Results

5.2 Interpretation

6. MPS Effect in an Environment with Many Vision Workers

7. Final Conclusion

7.1 Multi-GPU Server: PRO 5000 x3

7.2 Single-GPU Server: PRO 6000 x1

7.3 Environment with Many Vision Workers

8. Operational Recommendations

9. References