EVA Agent Dependencies
This is an installation guide for the Dependencies required to run EVA Agent.
Before installing EVA Agent, you must first set up the infrastructure for Qdrant (vector DB) for data storage and vLLM for model inference.
This guide explains how to install the foundational dependency packages for EVA.
Understanding the Installation Structure
To ensure a successful installation, please check the dependencies and the required order between packages.
- eva-agent-init: Defines the Storage Class. (Must be installed first)
- qdrant / vllm: Uses the storage defined above to store data.
- eva-agent: Installed last, after the above services are fully prepared.
Prerequisites
Please verify that the required CLI tools are installed.
-
kubectl: Cluster control tool Installation
-
helm: Package management tool Installation
-
kustomize: Configuration customization tool (required for post-rendering)
# Install kustomize binary (Linux)
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
chmod +x kustomize
sudo mv kustomize /usr/local/bin/
kustomize version -
On-premise setup (not required when using cloud services such as AWS or NCP)
-
Install k3s
curl -sfL https://get.k3s.io | sudo sh -s - --docker
mkdir -p $HOME/.kube
sudo cp /etc/rancher/k3s/k3s.yaml $HOME/.kube/config
sudo chown $(id -un):$(id -gn) $HOME/.kube/config
kubectl version -
Install NFS CSI Driver
helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm repo update
helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system --version v4.11.0 -
Install NFS server & expose the directory
sudo apt update
sudo apt install nfs-kernel-server -y
NFS_SHARE_PATH=/data001/share/eva-agent
sudo mkdir -p ${NFS_SHARE_PATH}
# Create cache directories for EVA Agent / vLLM
# Set ownership/permissions on the NFS server
sudo mkdir -p ${NFS_SHARE_PATH}/agent-cache ${NFS_SHARE_PATH}/vllm-cache
sudo chown -R 10001:10001 ${NFS_SHARE_PATH}
sudo chmod -R 0775 ${NFS_SHARE_PATH}
# Allow NFS share only for localhost (assumes single-node k3s)
# Change the IP address if you need to share to a different node
echo "${NFS_SHARE_PATH} 127.0.0.1(rw,sync,no_subtree_check,root_squash,anonuid=10001,anongid=10001)" | sudo tee -a /etc/exports
sudo exportfs -ra
sudo systemctl restart nfs-kernel-server
# test
showmount -e localhost
sudo mkdir /mnt/tmp && sudo mount -t nfs -o rw,nfsvers=4 localhost:${NFS_SHARE_PATH} /mnt/tmp
sudo umount /mnt/tmp
-
Register and Update Helm Repositories
Register all required open-source repositories and keep them up to date. Also set up the namespace for EVA Agent installation in advance.
# 1. Add each repository
helm repo add qdrant https://qdrant.github.io/qdrant-helm
helm repo add eva-agent https://mellerikat.github.io/eva-agent
# 2. Update to the latest information
helm repo update
# 3. Create the namespace/service account for EVA Agent
kubectl create namespace eva-agent
kubectl create serviceaccount sa-eva-agent -n eva-agent
Step 1: Install eva-agent-init
This package is a critical step that pre-defines a common Storage Class so that Qdrant and vLLM (installed later) can smoothly store data.
- Package role:
- It configures a dedicated storage class to ensure
eva-agent-vllmandeva-agent-qdrantcan store data safely. - Therefore, it must be installed first, before any other packages.
- It configures a dedicated storage class to ensure
Download Configuration Files (eva-agent-init)
Download the configuration files above from the GitHub repository. Choose the file that matches your environment.
Values templates are organized by EVA Agent version (image tag). Helm charts and images use separate versions.
# Example: download k3s values
RELEASE_VERSION="2.6.0"
BASE_URL="https://raw.githubusercontent.com/mellerikat/eva-agent/chartmuseum/release/2.6.0"
mkdir -p eva-agent-init
curl -L \
"$BASE_URL/eva-agent-init/values-k3s.yaml" \
-o eva-agent-init/values-k3s.yaml
If you use k3s + NFS, the share directory in values-k3s.yaml must match the mount directory on the NFS server (e.g., {NFS_SHARE_PATH}).
Model caches can require tens of GB (or more), so use an NFS mount path on a disk with sufficient capacity (recommended: a data disk mount if available).
# (Optional) k3s + NFS: set share path to your NFS mount directory
NFS_SHARE_PATH="/data001/share/eva-agent"
# Update storageClass.fileSystem.parameters.share
sed -i "s|^[[:space:]]*share:.*| share: $NFS_SHARE_PATH|" eva-agent-init/values-k3s.yaml
# Example: install for k3s environment
helm install eva-agent-init eva-agent/eva-agent-init \
--version=1.0.0 \
-n eva-agent \
-f eva-agent-init/values-k3s.yaml
Step 2: Configure eva-agent-qdrant values
Install the Qdrant DB to store vector data.
- values templates: Find required templates in the GitHub repository.
- Detailed value descriptions: See the Artifact Hub Qdrant page for full parameters.
💡 Note: PVCs/PVs are created using the Storage Class defined in
eva-agent-init. Existing PVs are preserved during reinstallation, but you must usekubectl deletefor manual cleanup if you wish to remove them entirely.
Update Settings for Your Environment
| Category | Name | Description | Value |
|---|---|---|---|
| General | nameOverride | Chart name override | eva-agent-qdrant |
| General | fullnameOverride | Full resource name override | "" |
| ServiceAccount | serviceAccount.create | Create ServiceAccount | false |
| ServiceAccount | serviceAccount.name | ServiceAccount name | sa-eva-agent |
| Image | image.pullPolicy | Image pull policy | Always |
| Storage | persistence.accessModes | PVC access modes | ["ReadWriteOnce"] |
| Storage | persistence.size | PVC size allocated to Qdrant | 10Gi |
| Storage | persistence.annotations | Annotations for PVC/PV (add if needed) | {} |
| Storage | persistence.storageVolumeName | (Depends on environment) PV/volume identifier name | eva-agent-qdrant-storage |
| Storage | persistence.storageClassName | StorageClass name | eva-agent-sc-bs |
| Snapshot | snapshotPersistence.enabled | Enable dedicated snapshot PVC persistence | true |
| Snapshot | snapshotPersistence.accessModes | Snapshot PVC access modes | ["ReadWriteOnce"] |
| Snapshot | snapshotPersistence.size | Snapshot PVC size | 10Gi |
| Snapshot | snapshotPersistence.annotations | Annotations for snapshot PVC/PV | {} |
| Snapshot | snapshotPersistence.snapshotsVolumeName | PV/volume identifier for snapshot storage | eva-agent-qdrant-snapshots |
| Snapshot | snapshotPersistence.storageClassName | StorageClass name for snapshot PVC | eva-agent-sc-bs |
| Snapshot | snapshotRestoration.enabled | Optional advanced setting to mount an external restore-source PVC (commented by default) | false (commented) |
| Snapshot | snapshotRestoration.pvcName | External restore-source PVC name (optional, commented by default) | qdrant-snapshot-restore-pvc |
| Snapshot | snapshotRestoration.mountPath | Mount path for external restore source (optional, commented by default) | /qdrant/snapshot-restoration |
| Snapshot | snapshotRestoration.snapshots | Snapshot file list to restore (optional, commented by default) | [] |
| Scheduling | nodeSelector | Node label selector (schedule only on specific nodes) | |
Download Configuration Files (eva-agent-qdrant)
Download the values and post-renderer plugin templates for Qdrant.
- Common values template
- AWS values template
- NCP values template
- Post renderer template
- Post renderer plugin metadata
RELEASE_VERSION="2.6.0"
BASE_URL="https://raw.githubusercontent.com/mellerikat/eva-agent/chartmuseum/release/2.6.0"
# Prepare directories
mkdir -p eva-agent eva-agent-init eva-agent-qdrant eva-agent-vllm \
plugin/eva-agent-qdrant
# Qdrant values & plugin
curl -L "$BASE_URL/eva-agent-qdrant/values.yaml" -o eva-agent-qdrant/values.yaml
curl -L "$BASE_URL/eva-agent-qdrant/values-aws.yaml" -o eva-agent-qdrant/values-aws.yaml
curl -L "$BASE_URL/eva-agent-qdrant/values-ncp.yaml" -o eva-agent-qdrant/values-ncp.yaml
curl -L "$BASE_URL/plugins/eva-agent-qdrant/post-renderer.sh" -o plugin/eva-agent-qdrant/post-renderer.sh
curl -L "$BASE_URL/plugins/eva-agent-qdrant/plugin.yaml" -o plugin/eva-agent-qdrant/plugin.yaml
Qdrant recovery-related values
This installation guide does not include restore procedures. Use the value settings below as operational references:
snapshotPersistence.enabled- Keep
trueso/qdrant/snapshotsis persisted on PVC.
- Keep
snapshotRestoration.enabled- Optional advanced setting to mount an external PVC as a recovery source path.
snapshotRestoration.pvcName- External PVC name used only when
snapshotRestoration.enabled=true.
- External PVC name used only when
snapshotRestoration.mountPath- Mount path for external recovery source files.
args- Keep chart default startup command:
["./config/initialize.sh"].
- Keep chart default startup command:
Qdrant and vLLM installation will be performed together in the script execution step below.
Step 3: Configure eva-agent-vllm values
Install vLLM, the model inference server. (Agent image version 2.2-a2.0 or later is required)
- values templates: Available in the GitHub repository.
- Chart source: This guide installs the custom chart
eva-agent/eva-agent-vllmfrom theeva-agentHelm repository. - Detailed value descriptions: Use the release templates in the
eva-agent-vllmdirectory first; Artifact Hub vLLM-stack can be used as an optional engine-level reference. - Key setting: For smooth dynamic allocation of PVC/PV, make sure
nodeSelectoris configured correctly.
💡 Note: PVCs/PVs are created using the Storage Class defined in
eva-agent-init. Existing PVs are preserved during reinstallation, but you must usekubectl deletefor manual cleanup if you wish to remove them entirely.
Update Settings for Your Environment
Select deployment environment / service / detail group in the filter below to show only the relevant settings.
| Category | Name | Description | Value | Notes |
|---|---|---|---|---|
| Router | routerSpec.enableRouter | Enable Model Router | true | See config guide |
| Router | routerSpec.routingLogic | Routing strategy (roundrobin or session; session sticks by key, otherwise lowest QPS) | "roundrobin" | |
| Router | routerSpec.serviceDiscovery | Backend discovery mode | "static" | See config guide |
| Router | routerSpec.staticBackends | Static backend endpoints | "http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.eva-agent.svc.cluster.local, http://external.vllm.ip:port" | See config guide |
| Router | routerSpec.staticModels | Static backend model mapping | "qwen3-vl-8b-instruct-fp8,Exaone4.0" | See config guide |
| Router | routerSpec.serviceType | Router Service type | "ClusterIP" | Default is no external exposure |
| Router | routerSpec.servicePort | Router Service port | 80 | |
| Router | routerSpec.containerPort | Router container port | 8000 |
Expand value configuration guide
-
routerSpec.enableRouter
- The default is always enabled. In production, we recommend keeping it
trueunless there is a specific reason to disable it. - The Router does not create new engines; it routes requests only to existing vLLM endpoints.
- These endpoints can be engine pods deployed together in the same Helm release, or a different cluster / a different deployment.
- They do not have to be Kubernetes pods. Any HTTP-reachable vLLM endpoint works: Docker containers, bare-metal processes, Python servers, etc.
- If you run multiple models on a single node, or if engines are distributed, the Router is effectively required.
- In short, the Router is a reverse proxy that groups “vLLM endpoints accessible anywhere” into a single entrypoint and distributes traffic.
- The default is always enabled. In production, we recommend keeping it
-
routerSpec.serviceDiscovery
- Supported values:
k8s,static. - The recommended value is
static. Even if you use only the same cluster,staticworks and results in a predictable production setup. - Configure vLLM engines to be exposed via a non-headless Service (ClusterIP/NodePort).
Since the internal DNS acts as the primary entrypoint, you can track endpoints usingstaticeven within the same cluster.
From an ops perspective, there is no major difference compared tok8s; it is mainly a management preference. k8sis advantageous in environments where endpoints change frequently (autoscaling, frequent rolling). It follows Endpoints automatically.staticis suitable when routing targets are clearly fixed or when you need to include external endpoints. When targets change, you must update the list manually.
- Supported values:
-
routerSpec.routingLogic
roundrobin: Evenly distributes in list order.session: Sticky routing by session key; if no key is provided, it selects the engine with the lowest QPS.- This is not resource-based routing. If you need weighted distribution, repeat the target URL in
staticBackends(e.g., 6:1).
-
routerSpec.staticBackends
- A single-line string separated by commas (,).
- If you are in the same cluster, do not use a Headless service. Use a ClusterIP/NodePort service as the entrypoint.
- Service DNS is constructed from the model name:
http://eva-agent-vllm-<servingEngineSpec.modelSpec[].name>-engine-service.<namespace>.svc.cluster.local- Example (model name
qwen3-vl-8b-fp8):http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.eva-agent.svc.cluster.local
- Always include the namespace. The Router URL validation rejects hostnames without a dot (
.) as invalid. - External endpoint example:
http://203.0.113.10/v1
-
routerSpec.staticModels
- List model names separated by commas (,) in the same order as
staticBackends. - The model name must exactly match the
--served-model-namevalue inservingEngineSpec.modelSpec[].vllmConfig.extraArgs. - If unsure, check the running endpoint with
curl <endpoint>/v1/models(pod/container/external server all work).- The model name is the
idfield in the response. Example:{"object":"list","data":[{"id":"qwen3-vl-8b-instruct-fp8"...}]}.
- The model name is the
- List model names separated by commas (,) in the same order as
-
servingEngineSpec.modelSpec[].name
- This value is used to generate the service DNS for
routerSpec.staticBackends. - We recommend not changing it for the same model. (If you change the model, the Hugging Face repo changes, so the model name must change as well.)
- This value is used to generate the service DNS for
-
servingEngineSpec.modelSpec[].modelURL
- Enter the Hugging Face repo path of the model to serve. (Example:
Qwen/Qwen3-VL-8B-Instruct-FP8)
- Enter the Hugging Face repo path of the model to serve. (Example:
-
servingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize
- Meaning: the number of GPUs used by one model (engine pod).
- Overhead: as TP increases, GPU-to-GPU communication cost increases. Set it to the minimum value that satisfies memory/throughput needs.
- Constraint:
tensorParallelSizemust divide the model’s number of attention heads (num_attention_heads). - See Server-specific deployment guide for detailed setups.
-
servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization
- Controls the upper bound of GPU memory usage.
- Do not set it to
1.0. KV cache does not include embedding memory for prefix/multimodal inputs, so you need headroom. - Use a higher value when the GPU is dedicated (default
0.9), and a lower value when sharing with EVA Vision. - For a 24GB GPU, we recommend starting in the
0.8to0.85range. - See Server-specific deployment guide for concrete examples.
-
servingEngineSpec.modelSpec[].vllmConfig.extraArgs
--served-model-namemust matchstaticModels.kv-cache-dtypedepends on GPU architecture; for pre-Ada GPUs, useauto.
-
servingEngineSpec.modelSpec[].requestCPU
- CPU is independent of GPU. Allocate enough for tokenizer/workers/I/O.
- If co-located on the same node as EVA Vision, leave CPU headroom for the EVA Vision pod as well.
-
servingEngineSpec.modelSpec[].requestMemory
- This is system RAM (not VRAM). If too low, you can OOM even if GPU is sufficient.
- If KV cache pressure is high, keep
maxModelLenat 12K and increase RAM or scale up GPU/TP.
-
servingEngineSpec.modelSpec[].requestGPU
- Important: specifying
requestGPUmakes Kubernetes allocate the GPU exclusively. - If you need GPU sharing, comment out
requestGPUand tune withgpuMemoryUtilization/maxModelLen. - If EVA Vision uses only part of the GPU, you can run multiple engines with fractional GPU only when MIG/time-slicing is enabled. (If disabled, only integer GPUs are supported.)
- For dedicated use, keep
requestGPUas an integer and tune withgpuMemoryUtilization. Values like0.9cannot be used forrequestGPU(except MIG/time-slicing). - See Server-specific deployment guide for server-specific configurations.
- Important: specifying
Expand server-specific deployment guide
- Organize settings based on the actual servers where EVA is deployed.
- For all cases below, the order of
routerSpec.staticBackendsandrouterSpec.staticModelsmust match.
- RTX A6000 x1 (48GB)
- L40s x1 (48GB)
- RTX 4090 x3 (24GB)
- RTX PRO 5000 x3 (48GB)
- Additional case
- Environment
- Share GPU with EVA Vision
- Connect one additional external vLLM pod (Router routes Qwen3‑VL 8B & Exaone 4.0)
- Required settings
routerSpec.enableRouter: truerouterSpec.serviceDiscovery: "static"routerSpec.staticBackends: comma-separated internal engine + external engine URL- Internal engine example:
http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.<namespace>.svc.cluster.local - External engine example:
http://203.0.113.10/v1
- Internal engine example:
routerSpec.staticModels:qwen3-vl-8b-instruct-fp8,Exaone4.0servingEngineSpec.modelSpec[].name: qwen3-vl-8b-fp8servingEngineSpec.modelSpec[].vllmConfig.extraArgs:--served-model-name qwen3-vl-8b-instruct-fp8,--kv-cache-dtype autoservingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize: 1servingEngineSpec.modelSpec[].requestGPU: comment out (share GPU with EVA Vision)servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization: 0.7(assuming EVA Vision shares 10GB)servingEngineSpec.modelSpec[].vllmConfig.maxModelLen: 12288
- Checkpoints
- For the external engine, verify that
/v1/modelsidisExaone4.0 routerSpec.staticBackendsonly allows URLs that contain a dot (.) (service/domain must be explicit)- Calculation: 48GB - 10GB(EVA Vision) = 38GB → 38/48 ≈ 0.79, set to
0.7for operational headroom - For Ampere GPUs, use
--kv-cache-dtype auto
- For the external engine, verify that
- Environment
- Share GPU with EVA Vision
- Connect one additional external vLLM pod (Router routes Qwen3‑VL 8B & Exaone 4.0)
- Required settings
routerSpec.enableRouter: truerouterSpec.serviceDiscovery: "static"routerSpec.staticBackends: comma-separated internal engine + external engine URL- Internal engine example:
http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.<namespace>.svc.cluster.local - External engine example:
http://203.0.113.10/v1
- Internal engine example:
routerSpec.staticModels:qwen3-vl-8b-instruct-fp8,Exaone4.0servingEngineSpec.modelSpec[].name: qwen3-vl-8b-fp8servingEngineSpec.modelSpec[].vllmConfig.extraArgs:--served-model-name qwen3-vl-8b-instruct-fp8,--kv-cache-dtype fp8servingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize: 1servingEngineSpec.modelSpec[].requestGPU: comment out (share GPU with EVA Vision)servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization: 0.7(assuming EVA Vision shares 10GB)servingEngineSpec.modelSpec[].vllmConfig.maxModelLen: 12288
- Checkpoints
- For the external engine, verify that
/v1/modelsidisExaone4.0 routerSpec.staticBackendsonly allows URLs that contain a dot (.) (service/domain must be explicit)- Calculation: 48GB - 10GB(EVA Vision) = 38GB → 38/48 ≈ 0.79, set to
0.7for operational headroom - For Ada GPUs, you can use
--kv-cache-dtype fp8
- For the external engine, verify that
- Environment
- EVA Vision uses 1 GPU dedicated
- vLLM uses 2 GPUs dedicated (no external vLLM backend)
- Keep Router for operational convenience even for a single model (Qwen3‑VL 8B)
- Required settings
routerSpec.enableRouter: truerouterSpec.serviceDiscovery: "static"routerSpec.staticBackends: register internal engine only- Example:
http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.<namespace>.svc.cluster.local
- Example:
routerSpec.staticModels:qwen3-vl-8b-instruct-fp8servingEngineSpec.modelSpec[].name: qwen3-vl-8b-fp8servingEngineSpec.modelSpec[].vllmConfig.extraArgs:--served-model-name qwen3-vl-8b-instruct-fp8,--kv-cache-dtype fp8servingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize: 1servingEngineSpec.modelSpec[].requestGPU: 1(dedicated)servingEngineSpec.modelSpec[].replicaCount: 2(2 engine pods, 1 GPU each)servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization: 0.85(24GB, dedicated)servingEngineSpec.modelSpec[].vllmConfig.maxModelLen: 12288
- Checkpoints
routerSpec.staticBackendsonly allows URLs that contain a dot (.)- Based on vLLM logs, ~7 concurrent 12K requests are possible with 8-bit KV cache → two 1GPU engines are better for concurrency
- You can disable Router to reduce overhead with a single backend, but we recommend keeping it for scalability
- Chart capability
- vLLM production stack can create multiple engines via
servingEngineSpec.modelSpec, generating a Deployment per entry so per-engine patches are possible.
- vLLM production stack can create multiple engines via
- Setup A — Dedicated 2 + 1
- EVA Vision uses 1 GPU dedicated.
- vLLM uses 2 GPUs dedicated (no external backend, Qwen3‑VL only).
- Keep Router for a stable entrypoint.
- Required settings (vLLM)
routerSpec.enableRouter: truerouterSpec.serviceDiscovery: "static"routerSpec.staticBackends: register internal engine only- Example:
http://eva-agent-vllm-qwen3-vl-8b-fp8-engine-service.<namespace>.svc.cluster.local
- Example:
routerSpec.staticModels:qwen3-vl-8b-instruct-fp8servingEngineSpec.modelSpec[].vllmConfig.extraArgs:--served-model-name qwen3-vl-8b-instruct-fp8,--kv-cache-dtype fp8tensorParallelSize: 1,requestGPU: 1,replicaCount: 2servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization: 0.9(dedicated)servingEngineSpec.modelSpec[].vllmConfig.maxModelLen: 12288
- Setup B — 2 GPU + MIG split (2.5 + 0.5)
- Use 2 GPUs as normal GPUs for vLLM, and split the remaining GPU with MIG for vLLM + EVA Vision.
- Single release recommended: split
modelSpecinto GPU/MIG and configure Router to point to both.- Router distribution: to match 6:1 (dedicated:shared), repeat the dedicated engine URL 6 times in
staticBackends.- Example:
routerSpec.routingLogic: "roundrobin"routerSpec.staticBackends: 6xdedicated+ 1xsharedrouterSpec.staticModels: repeatqwen3-vl-8b-instruct-fp8in the same order
- Example:
- Example values:
routerSpec:
enableRouter: true
serviceDiscovery: "static"
staticBackends: "http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-gpu-engine-service.<ns>.svc.cluster.local,http://eva-agent-vllm-qwen3-vl-8b-fp8-mig-engine-service.<ns>.svc.cluster.local"
staticModels: "qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8,qwen3-vl-8b-instruct-fp8"
servingEngineSpec:
modelSpec:
- name: "qwen3-vl-8b-fp8-gpu"
replicaCount: 2
requestGPU: 1
- name: "qwen3-vl-8b-fp8-mig"
replicaCount: 1
requestGPU: 1 # placeholder; patch with MIG
- Router distribution: to match 6:1 (dedicated:shared), repeat the dedicated engine URL 6 times in
- MIG patch (post-renderer): Replace
nvidia.com/gpu→nvidia.com/mig-*only for the MIG engine Deployment.- Do not request
nvidia.com/gpuandnvidia.com/mig-*at the same time in a single pod. For the MIG engine, remove the GPU resource. - Example (48GB; replace with your actual profile name):
apiVersion: apps/v1
kind: Deployment
metadata:
name: eva-agent-vllm-qwen3-vl-8b-fp8-mig-deployment-vllm
spec:
template:
spec:
containers:
- name: vllm-container
resources:
limits:
nvidia.com/mig-2g.24gb: 1
requests:
nvidia.com/mig-2g.24gb: 1
- Do not request
- If patching in a single release is difficult
- Release A (normal GPU): router enabled,
replicaCount: 2,requestGPU: 1 - Release B (MIG): router disabled,
replicaCount: 1, patchnvidia.com/mig-*via post-renderer
- Release A (normal GPU): router enabled,
- Setup C — Split k3s VMs + GPU passthrough
- Split the host into two VMs and assign GPUs via passthrough (2 GPUs / 1 GPU).
- If you set up k3s nodes on each VM, the scheduler recognizes physically separated GPU pools.
- VM-A (2 GPUs, dedicated vLLM)
requestGPU: 1,replicaCount: 2(2 engines, 1 GPU each)
- VM-B (1 GPU, EVA Vision + shared vLLM)
- Node for EVA Vision
- For vLLM, comment out
requestGPUand tunegpuMemoryUtilization
- Router distribution: 6:1 (VM-A dedicated : VM-B shared)
routerSpec.routingLogic: "roundrobin"routerSpec.staticBackends: 6xVM-A+ 1xVM-B
- Checkpoints
- Check available MIG profiles:
kubectl describe node | grep nvidia.com/mig - If no MIG profiles exist, operate with Setup A
- Check available MIG profiles:
- Add other server setups here (e.g., L40s x1, multi-GPU, etc.).
Download Configuration Files (eva-agent-vllm)
Download the values templates for vLLM.
Among the templates below, common values.yaml and values-k3s.yml are based on an Nvidia A6000 GPU server, and values-aws.yaml is based on a server with Nvidia L40s GPU x1.
You must adjust vLLM-related settings based on the table above and your GPU server specifications.
RELEASE_VERSION="2.6.0"
BASE_URL="https://raw.githubusercontent.com/mellerikat/eva-agent/chartmuseum/release/2.6.0"
# vLLM values
curl -L "$BASE_URL/eva-agent-vllm/values.yaml" -o eva-agent-vllm/values.yaml
curl -L "$BASE_URL/eva-agent-vllm/values-k3s.yaml" -o eva-agent-vllm/values-k3s.yaml
curl -L "$BASE_URL/eva-agent-vllm/values-aws.yaml" -o eva-agent-vllm/values-aws.yaml
curl -L "$BASE_URL/eva-agent-vllm/values-ncp.yaml" -o eva-agent-vllm/values-ncp.yaml
Step 4: Install Qdrant and vLLM
After updating Qdrant and vLLM values to match your deployment server, install them using the script.
You can pass the options below as needed.
--namespace <ns>: installation namespace (default:eva-agent)--release <ver>: release version (default:2.6.0)--base-dir <dir>: base directory for values/plugin directories (default: current directory)--qdrant-chart-version <ver>: Qdrant chart version (default:1.16.3)--vllm-chart-version <ver>: vLLM chart version (default:0.1.8)--qdrant-values <file>: additional Qdrant values file (can be used multiple times)--vllm-values <file>: additional vLLM values file (can be used multiple times)
For the full option list/defaults, run ./install_eva_agent_dependencies.sh --help.
# Download the install script
curl -L "https://raw.githubusercontent.com/mellerikat/eva-agent/chartmuseum/install_eva_agent_dependencies.sh" \
-o install_eva_agent_dependencies.sh
chmod +x install_eva_agent_dependencies.sh
# Run after editing values (Qdrant uses a post-renderer; vLLM uses the custom chart directly)
# k3s example
./install_eva_agent_dependencies.sh \
--qdrant-values eva-agent-qdrant/values-k3s.yaml \
--vllm-values eva-agent-vllm/values-k3s.yaml
# AWS example
./install_eva_agent_dependencies.sh \
--qdrant-values eva-agent-qdrant/values-aws.yaml \
--vllm-values eva-agent-vllm/values-aws.yaml
# NCP example
./install_eva_agent_dependencies.sh \
--qdrant-values eva-agent-qdrant/values-ncp.yaml \
--vllm-values eva-agent-vllm/values-ncp.yaml