Skip to main content
Version: Next

Manage Nodegroup

Updated 2024.10.02

Topics

  1. Prerequisites
  2. Add New Nodegroup
  3. Update Nodegroup
  4. Delete Nodegroup


Detailed Steps

For more details on variables, refer to the Terminology page.

  • Nodegroup Overview
    • Nodegroups are only used when running training jobs.
    • Nodegroups are created per Workspace (Billing is separated).
  • Nodegroup Types (Label)
    • To provide convenience, we offer familiar terms (Labels) for selection.
      • {PROJECT_NODEGROUP_SPEC}: low, standard, high, low-gpu, standard-gpu, high-gpu
  • Nodegroup Specifications
    • {PROJECT_NODEGROUP_EC2_NAME}: Defines EC2 performance based on the Nodegroup type.
    • {PROJECT_NODEGROUP_MAX}: Defines the maximum number of EC2 instances that can be run simultaneously based on the training service requirements.
    • availabilityZones are defined by {PROJECT_NODEGROUP_EC2_NAME}.

1. Prerequisites

Ensure that the environment setup for installation is completed. (Refer to 1. Setup Installation Environment)

export AWS_CLUSTER_NAME=
export AWS_DEFAULT_REGION=
export AWS_DEFAULT_REGION_ALIAS=
export INFRA_NAME=
export DEPLOY_ENV=
export AIC_BACKEND_URL=
export PROJECT_NAME=
export WORKSPACE_NAME=${PROJECT_NAME}-ws
export KUBEFLOW_NAMESPACE_NAME=aic-ns-${WORKSPACE_NAME}
export KUBEFLOW_USER_NAME=aic-user-${WORKSPACE_NAME}
export KUBEFLOW_USER_PASSWD='$2a$12$A6GAI7xf1CjfPCF3MnycvuDcXUdP4O.Ruo7PvQUFUkmKGSYcCiieS'
export KUBEFLOW_USER_UNIQUE_ID=`$(echo date +"%Y%m%d%H%M%S")`
export PROJECT_S3_BUCKET_NAME=s3-${AWS_DEFAULT_REGION_ALIAS}-${INFRA_NAME}-${DEPLOY_ENV}-${PROJECT_NAME}
export PROJECT_NODEGROUP_SPEC=standard
export PROJECT_NODEGROUP_LABEL=${PROJECT_NAME}-ws-${PROJECT_NODEGROUP_SPEC}
export PROJECT_NODEGROUP_NAME=ng-${AWS_DEFAULT_REGION_ALIAS}-aicond-${PROJECT_NAME}-ws-${PROJECT_NODEGROUP_SPEC}
export PROJECT_NODEGROUP_DESIRED_SIZE=
export PROJECT_NODEGROUP_MIN=0
export PROJECT_NODEGROUP_MAX=
export PROJECT_NODEGROUP_EC2_NAME=
export PROJECT_NODEGROUP_EC2_VCPU=
export PROJECT_NODEGROUP_EC2_MEM=
export PROJECT_NODEGROUP_EC2_GPU=


2. Add Nodegroup

  • Create the Nodegroup infrastructure.

    • Create the create-nodegroup.yaml file that defines the Nodegroup.

      [Expand create-nodegroup.yaml]

      NOTE: The "propagateASGTags: true" setting is mandatory.

      cat <<EOT > create-nodegroup.yaml
      apiVersion: eksctl.io/v1alpha5
      kind: ClusterConfig
      managedNodeGroups:
      - amiFamily: AmazonLinux2
      desiredCapacity: ${PROJECT_NODEGROUP_DESIRED_SIZE}
      disableIMDSv1: false
      disablePodIMDS: false
      iam:
      withAddonPolicies:
      albIngress: false
      appMesh: false
      appMeshPreview: false
      autoScaler: true
      awsLoadBalancerController: false
      certManager: false
      cloudWatch: false
      ebs: false
      efs: false
      externalDNS: false
      fsx: false
      imageBuilder: false
      xRay: false
      instanceSelector: {}
      instanceType: ${PROJECT_NODEGROUP_EC2_NAME}
      labels:
      aic-role: ${PROJECT_NODEGROUP_LABEL}
      alpha.eksctl.io/cluster-name: ${AWS_CLUSTER_NAME}
      alpha.eksctl.io/nodegroup-name: ${PROJECT_NODEGROUP_NAME}
      maxSize: ${PROJECT_NODEGROUP_MAX}
      minSize: ${PROJECT_NODEGROUP_MIN}
      name: ${PROJECT_NODEGROUP_NAME}
      availabilityZones: ["${AWS_DEFAULT_REGION}a", "${AWS_DEFAULT_REGION}c"]
      privateNetworking: true
      releaseVersion: ""
      securityGroups:
      withLocal: null
      withShared: null
      ssh:
      allow: false
      publicKeyPath: ""
      tags:
      alpha.eksctl.io/nodegroup-name: ${PROJECT_NODEGROUP_NAME}
      alpha.eksctl.io/nodegroup-type: managed
      volumeIOPS: 3000
      volumeSize: 50
      volumeThroughput: 125
      volumeType: gp3
      propagateASGTags: true
      metadata:
      name: ${AWS_CLUSTER_NAME}
      region: ${AWS_DEFAULT_REGION}
      EOT
    • Create the Nodegroup using the following command:

      eksctl create nodegroup --config-file=create-nodegroup.yaml
      [Expand Trouble Shooting: 'AccessConfig']
      error getting cluster stack template: failed to parse GetStackTemplate response: json: unknown field "AccessConfig"

      Run after updating eksctl.

  • Reflect the created Nodegroup in the AI Conductor's Workspace.

    • Access {AIC_BACKEND_URL} and log in with an ADMIN user.

    • Add the Nodegroup information to POST /api/v1/workspaces/{workspace_id}/exespecs.

      • workspace_id: You can check it through GET /api/v1/workspaces.
      • name: {PROJECT_NODEGROUP_SPEC}
      • vcpu: {PROJECT_NODEGROUP_EC2_VCPU}
      • ram_gb: {PROJECT_NODEGROUP_EC2_MEM}
      • gpu: {PROJECT_NODEGROUP_EC2_GPU}
      [Expand Example of adding nodegroups]

      Example of adding two Nodegroups:

      [
      {
      "name": "standard",
      "vcpu": 2,
      "ram_gb": 8,
      "gpu": 0
      },
      {
      "name": "high",
      "vcpu": 8,
      "ram_gb": 32,
      "gpu": 0
      }
      ]


3. Updating the Nodegroup

  • Reflect the modified Nodegroup in the AI Conductor's Workspace.
    • Access {AIC_BACKEND_URL} and log in with an ADMIN user.

    • Modify the Nodegroup information using PATCH /api/v1/workspaces/{workspace_id}/exespecs.

      • workspace_id: You can check it through GET /api/v1/workspaces.
      • name: {PROJECT_NODEGROUP_SPEC}
        • NOTE: Use the names of the existing Nodegroups.
      • vcpu: {PROJECT_NODEGROUP_EC2_VCPU}
      • ram_gb: {PROJECT_NODEGROUP_EC2_MEM}
      • gpu: {PROJECT_NODEGROUP_EC2_GPU}
      [Expand Example of updating nodegroups]

      Example of modifying two Nodegroups:

      [
      {
      "name": "standard",
      "vcpu": 4,
      "ram_gb": 16,
      "gpu": 0
      },
      {
      "name": "high",
      "vcpu": 16,
      "ram_gb": 64,
      "gpu": 0
      }
      ]


4. Deleting the Nodegroup

  • Delete the Nodegroup infrastructure.

    • Use the create-nodegroup.yaml to delete the Nodegroup.
      eksctl delete nodegroup --config-file=create-nodegroup.yaml --approve
    • Delete the Nodegroup using the eksctl command.
      eksctl delete nodegroup --cluster ${AWS_CLUSTER_NAME} --region ${AWS_DEFAULT_REGION} --name ${PROJECT_NODEGROUP_NAME}
  • Reflect the deleted Nodegroup in the AI Conductor's Workspace.

    • Access {AIC_BACKEND_URL} and log in with an ADMIN user.

    • Delete the Nodegroup information using POST /api/v1/workspaces/{workspace_id}/exespecs/delete.

      • workspace_id: You can check it through GET /api/v1/workspaces.
      • name: {PROJECT_NODEGROUP_SPEC}
        • NOTE: Use the names of the existing Nodegroups.
      [Expand Example of deleting nodegroups]

      Example of deleting two Nodegroups:

      [
      {
      "name": "standard"
      },
      {
      "name": "high"
      }
      ]