Manage Nodegroup
Topics
Detailed Steps
For more details on variables, refer to the Terminology page.
- Nodegroup Overview
- Nodegroups are only used when running training jobs.
- Nodegroups are created per Workspace (Billing is separated).
- Nodegroup Types (Label)
- To provide convenience, we offer familiar terms (Labels) for selection.
- {PROJECT_NODEGROUP_SPEC}: low, standard, high, low-gpu, standard-gpu, high-gpu
- To provide convenience, we offer familiar terms (Labels) for selection.
- Nodegroup Specifications
- {PROJECT_NODEGROUP_EC2_NAME}: Defines EC2 performance based on the Nodegroup type.
- {PROJECT_NODEGROUP_MAX}: Defines the maximum number of EC2 instances that can be run simultaneously based on the training service requirements.
- availabilityZones are defined by {PROJECT_NODEGROUP_EC2_NAME}.
1. Prerequisites
Ensure that the environment setup for installation is completed. (Refer to 1. Setup Installation Environment)
export AWS_CLUSTER_NAME=
export AWS_DEFAULT_REGION=
export AWS_DEFAULT_REGION_ALIAS=
export INFRA_NAME=
export DEPLOY_ENV=
export AIC_BACKEND_URL=
export PROJECT_NAME=
export WORKSPACE_NAME=${PROJECT_NAME}-ws
export KUBEFLOW_NAMESPACE_NAME=aic-ns-${WORKSPACE_NAME}
export KUBEFLOW_USER_NAME=aic-user-${WORKSPACE_NAME}
export KUBEFLOW_USER_PASSWD='$2a$12$A6GAI7xf1CjfPCF3MnycvuDcXUdP4O.Ruo7PvQUFUkmKGSYcCiieS'
export KUBEFLOW_USER_UNIQUE_ID=`$(echo date +"%Y%m%d%H%M%S")`
export PROJECT_S3_BUCKET_NAME=s3-${AWS_DEFAULT_REGION_ALIAS}-${INFRA_NAME}-${DEPLOY_ENV}-${PROJECT_NAME}
export PROJECT_NODEGROUP_SPEC=standard
export PROJECT_NODEGROUP_LABEL=${PROJECT_NAME}-ws-${PROJECT_NODEGROUP_SPEC}
export PROJECT_NODEGROUP_NAME=ng-${AWS_DEFAULT_REGION_ALIAS}-aicond-${PROJECT_NAME}-ws-${PROJECT_NODEGROUP_SPEC}
export PROJECT_NODEGROUP_DESIRED_SIZE=
export PROJECT_NODEGROUP_MIN=0
export PROJECT_NODEGROUP_MAX=
export PROJECT_NODEGROUP_EC2_NAME=
export PROJECT_NODEGROUP_EC2_VCPU=
export PROJECT_NODEGROUP_EC2_MEM=
export PROJECT_NODEGROUP_EC2_GPU=
2. Add Nodegroup
-
Create the Nodegroup infrastructure.
-
Create the
create-nodegroup.yaml
file that defines the Nodegroup.[Expand create-nodegroup.yaml]
NOTE: The "propagateASGTags: true" setting is mandatory.
cat <<EOT > create-nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
- amiFamily: AmazonLinux2
desiredCapacity: ${PROJECT_NODEGROUP_DESIRED_SIZE}
disableIMDSv1: false
disablePodIMDS: false
iam:
withAddonPolicies:
albIngress: false
appMesh: false
appMeshPreview: false
autoScaler: true
awsLoadBalancerController: false
certManager: false
cloudWatch: false
ebs: false
efs: false
externalDNS: false
fsx: false
imageBuilder: false
xRay: false
instanceSelector: {}
instanceType: ${PROJECT_NODEGROUP_EC2_NAME}
labels:
aic-role: ${PROJECT_NODEGROUP_LABEL}
alpha.eksctl.io/cluster-name: ${AWS_CLUSTER_NAME}
alpha.eksctl.io/nodegroup-name: ${PROJECT_NODEGROUP_NAME}
maxSize: ${PROJECT_NODEGROUP_MAX}
minSize: ${PROJECT_NODEGROUP_MIN}
name: ${PROJECT_NODEGROUP_NAME}
availabilityZones: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}c"]
privateNetworking: true
releaseVersion: ""
securityGroups:
withLocal: null
withShared: null
ssh:
allow: false
publicKeyPath: ""
tags:
alpha.eksctl.io/nodegroup-name: ${PROJECT_NODEGROUP_NAME}
alpha.eksctl.io/nodegroup-type: managed
volumeIOPS: 3000
volumeSize: 50
volumeThroughput: 125
volumeType: gp3
propagateASGTags: true
metadata:
name: ${AWS_CLUSTER_NAME}
region: ${AWS_DEFAULT_REGION}
EOT -
Create the Nodegroup using the following command:
eksctl create nodegroup --config-file=create-nodegroup.yaml
[Expand Trouble Shooting: 'AccessConfig']
error getting cluster stack template: failed to parse GetStackTemplate response: json: unknown field "AccessConfig"
Run after updating eksctl.
-
-
Reflect the created Nodegroup in the AI Conductor's Workspace.
-
Access {AIC_BACKEND_URL} and log in with an ADMIN user.
-
Add the Nodegroup information to POST /api/v1/workspaces/{workspace_id}/exespecs.
- workspace_id: You can check it through GET /api/v1/workspaces.
- name: {PROJECT_NODEGROUP_SPEC}
- vcpu: {PROJECT_NODEGROUP_EC2_VCPU}
- ram_gb: {PROJECT_NODEGROUP_EC2_MEM}
- gpu: {PROJECT_NODEGROUP_EC2_GPU}
[Expand Example of adding nodegroups]
Example of adding two Nodegroups:
[
{
"name": "standard",
"vcpu": 2,
"ram_gb": 8,
"gpu": 0
},
{
"name": "high",
"vcpu": 8,
"ram_gb": 32,
"gpu": 0
}
]
-
3. Updating the Nodegroup
- Reflect the modified Nodegroup in the AI Conductor's Workspace.
-
Access {AIC_BACKEND_URL} and log in with an ADMIN user.
-
Modify the Nodegroup information using PATCH /api/v1/workspaces/{workspace_id}/exespecs.
- workspace_id: You can check it through GET /api/v1/workspaces.
- name: {PROJECT_NODEGROUP_SPEC}
- NOTE: Use the names of the existing Nodegroups.
- vcpu: {PROJECT_NODEGROUP_EC2_VCPU}
- ram_gb: {PROJECT_NODEGROUP_EC2_MEM}
- gpu: {PROJECT_NODEGROUP_EC2_GPU}
[Expand Example of updating nodegroups]
Example of modifying two Nodegroups:
[
{
"name": "standard",
"vcpu": 4,
"ram_gb": 16,
"gpu": 0
},
{
"name": "high",
"vcpu": 16,
"ram_gb": 64,
"gpu": 0
}
]
-
4. Deleting the Nodegroup
-
Delete the Nodegroup infrastructure.
- Use the create-nodegroup.yaml to delete the Nodegroup.
eksctl delete nodegroup --config-file=create-nodegroup.yaml --approve
- Delete the Nodegroup using the eksctl command.
eksctl delete nodegroup --cluster ${AWS_CLUSTER_NAME} --region ${AWS_DEFAULT_REGION} --name ${PROJECT_NODEGROUP_NAME}
- Use the create-nodegroup.yaml to delete the Nodegroup.
-
Reflect the deleted Nodegroup in the AI Conductor's Workspace.
-
Access {AIC_BACKEND_URL} and log in with an ADMIN user.
-
Delete the Nodegroup information using POST /api/v1/workspaces/{workspace_id}/exespecs/delete.
- workspace_id: You can check it through GET /api/v1/workspaces.
- name: {PROJECT_NODEGROUP_SPEC}
- NOTE: Use the names of the existing Nodegroups.
[Expand Example of deleting nodegroups]
Example of deleting two Nodegroups:
[
{
"name": "standard"
},
{
"name": "high"
}
]
-