Skip to main content
Version: Next

GCR Parameter

Updated 2024.05.17

Overview of experimental_plan.yaml

To apply AI content to your data, you need to input the data information and the desired content functions into the experimental_plan.yaml file. After installing the AI content in the solution folder, you can find a pre-written experimental_plan.yaml file for each content under the solution folder. By entering 'data information' and modifying/adding 'user arguments' provided by each asset in this YAML file, you can execute ALO to generate a data analysis model with the desired settings.

Structure of experimental_plan.yaml

The experimental_plan.yaml includes various settings necessary to run ALO. By modifying the 'data path' and 'user arguments' among these settings, you can use the AI content immediately.

Inputting Data Paths (external_path)

  • The external_path parameter is used to specify the path of files to be loaded or the path where files will be saved. If save_train_artifacts_path and save_inference_artifacts_path are not specified, the modeling artifacts will be saved in the default paths train_artifacts and inference_artifacts folders, respectively.
external_path:
- load_train_data_path: ./solution/sample_data/train
- load_inference_data_path: ./solution/sample_data/test
- save_train_artifacts_path:
- save_inference_artifacts_path:
Parameter NameDEFAULTDescription and Options
load_train_data_path./sample_data/train/Enter the folder path where the training data is located (do not include the csv file name). All csv files under the specified path are concatenated.
load_inference_data_path./sample_data/test/Enter the folder path where the inference data is located (do not include the csv file name). All csv files under the specified path are concatenated.
*All files under the specified path, including those in subfolders, are concatenated.
*All columns in the files to be concatenated must be identical.

User Parameters (user_parameters)

  • The step under user_parameters refers to the asset name. For example, step: input refers to the input asset stage.
  • args refers to the user arguments of the input asset (step: input). User arguments are data analysis-related setting parameters provided by each asset. Refer to the User arguments description below for details.
user_parameters:
- train_pipeline:
- step: input
args:
- file_type
...
ui_args:
...

User arguments explanation

What are User arguments?

User arguments are parameters for setting the operations of each asset, which are entered under args in the respective asset steps of the experimental_plan.yaml. Each asset in the AI content pipeline provides user arguments to apply various functions to the data. Refer to the guide below to change or add user arguments to create a model that fits your data. User arguments are divided into 'required arguments' that are pre-written in the experimental_plan.yaml and 'Custom arguments' that users can add by referring to the guide.

Required arguments

  • Required arguments are the basic arguments that are immediately visible in the experimental_plan.yaml. Most required arguments have default values pre-set in the YAML file.
  • Users must enter values for the data-related arguments among the required arguments in the experimental_plan.yaml (e.g., x_columns, y_column).

Custom arguments

  • Custom arguments are functions provided by the asset but not listed in the experimental_plan.yaml. Users can add these arguments to the YAML file's respective asset args.

The GCR pipeline consists of Input - Readiness - Graph - Modeling (train/inference) - Output assets, and the user arguments are configured differently for each asset's function. First, try modeling with the default required arguments settings in the experimental_plan.yaml, and then add user arguments to create a GCR model that fits your data perfectly!


Summary of User arguments

Below is a summary of the user arguments for GCR. Click on the 'Argument Name' to navigate to its detailed explanation.

Default

  • The 'Default' column indicates the default value of the user argument.
  • If there is no default value, it is marked with '-'.
  • If the default value is to leave it empty, it is marked as ' '.
  • If there is logic behind the default value, it is marked as 'Refer to the description'. Click on the 'Argument Name' to see the detailed explanation.

ui_args

  • The 'ui_args' column indicates whether the ui_args function is supported, allowing the argument value to be changed in the AI Conductor UI.
  • O: If you enter the argument name under ui_args in the experimental_plan.yaml, you can change the argument value in the AI Conductor UI.
  • X: The ui_args function is not supported.
  • For detailed explanation about ui_args, please refer to the following guide. Write UI Parameter

User Configuration Required

  • The 'User Configuration Required' column indicates whether the user must check and change the argument before running the AI content.
  • O: Generally, task and data-related information that users need to input before modeling.
  • X: If the user does not change the value, the default value is used for modeling.
Asset NameArgument TypeArgument NameDefaultDescriptionUser Configuration Requiredui_args
InputRequiredfile_typecsvInput data file extension.XO
InputRequiredencodingutf-8Input data encoding type.XO
ReadinessRequiredx_columns' 'List of x column names to be used for training. If left blank, all columns except y_column are used.XO
ReadinessRequireddrop_columns' 'List of column names to exclude from x columns.XO
ReadinessRequiredy_column  -Name of the y column.OO
GraphRequireddimension32Number of dimensions for graph embeddings.XO
GraphRequirednum_epochs10Number of training epochs for graph embeddings algorithm.XO
GraphRequirednum_partitions1Number of partitions to divide the input data for embedding.XO
GraphRequireduse_gpuFalseWhether to use GPU for graph embedding in a GPU-available environment.XX
GraphCustomworkers1Number of processes for parallel execution during graph embedding.XX
GraphCustomcustom_connection_lhs' 'Left-hand columns to be connected based on domain knowledge.XX
GraphCustomcustom_connection_rhs' 'Right-hand columns to be connected based on domain knowledge.XX
GraphCustomcomparatordotFunction to compare the similarity of two embeddings during graph embedding.XX
GraphCustomloss_fnsoftmaxLoss function for training during graph embedding.XX
GraphCustomlr0.01Learning rate for training during graph embedding.XX
GraphCustombatch_size1000Batch size for training during graph embedding.XX
TrainRequiredtaskclassificationType of prediction task.XO
TrainRequiredeval_metricf1_scoreEvaluation metric for selecting the best model during HPO.XO
TrainRequirednum_hpo20Number of HPO trials.XO
InferenceRequiredglobal_xaiFalseWhether to perform global XAI during inference.XO
InferenceRequiredlocal_xaiFalseWhether to perform local XAI during inference.XO

Detailed Explanation of User arguments

Input asset

file_type

Specify the file extension of the input data. Currently, AI Solution development only supports csv files.

  • Argument type: Required
  • Input type
    • string
  • Possible values
    • csv (default)
  • Usage
    • file_type: csv  
  • ui_args: O

encoding

Specify the encoding type of the input data. Currently, AI Solution development only supports utf-8 encoding.

  • Argument type: Required
  • Input type
    • string
  • Possible values
    • utf-8 (default)
  • Usage
    • encoding: utf-8
  • ui_args: O

Readiness asset

x_columns

Enter the list of x column names in the dataframe. If left blank

, all columns except y_column are used as x columns.

  • Argument type: Required
  • Input type
    • list
  • Possible values
    • Empty (default) or list of column names
  • Usage
    • x_columns: [col1, col2]
  • ui_args: O

drop_columns

Enter the list of column names to exclude from x columns in the dataframe. If left blank, it means there are no columns to exclude.

  • Argument type: Required
  • Input type
    • list
  • Possible values
    • Empty (default) or list of column names
  • Usage
    • drop_columns: [col1, col2]
  • ui_args: O

y_column

Enter the name of the y column (label column) in the dataframe. The user must input the appropriate column name according to the data.

  • Argument type: Required
  • Input type
    • string
  • Possible values
    • column name
  • Usage
    • y_column: target
  • ui_args: O

Graph asset

dimension

Determine how many dimensions each column in the input data will be embedded into during graph embedding. The higher the dimension, the higher the vector separation, which increases model accuracy but requires more memory and longer embedding execution time.

  • Argument type: Required
  • Input type
    • int
  • Possible values
    • 4, 8, 16, 32 (default), 64, 128, 256, 512, 1024
  • Usage
    • dimension: 32
  • ui_args: O

num_epochs

Determine how many repeated attempts will be made during graph embedding. Typically, the loss value becomes saturated around 10 repetitions.

  • Argument type: Required
  • Input type
    • int
  • Possible values
    • 1~100 (10 (default))
  • Usage
    • num_epochs: 10
  • ui_args: O

num_partitions

GCR can perform graph embedding by dividing the entire input data into multiple pieces, reducing peak memory usage, and enabling operation in environments with limited memory. This argument determines into how many pieces the entire input data will be divided for embedding. The larger the num_partitions, the smaller the peak memory required, but the longer the time required to complete embedding the entire input data.

  • Argument type: Required
  • Input type
    • int
  • Possible values
    • 1 (default), 2, 4, 8, 16, 32, 64, 128, 256, 512
  • Usage
    • num_partitions: 1
  • ui_args: O

use_gpu

If GPU usage is available, setting use_gpu to True allows using the GPU for graph embedding.

  • Argument type: Required
  • Input type
    • boolean
  • Possible values
    • True, False (default)
  • Usage
    • use_gpu: False
  • ui_args: X

workers

Specify the number of processes for parallel execution during graph embedding.

  • Argument type: Custom
  • Input type
    • int
  • Possible values
    • 0~inf, 1 (default)
  • Usage
    • workers: 1
  • ui_args: X

custom_connection_lhs

GCR and other graph-powered machine learning models improve model accuracy by extracting useful information hidden in the data through graph representation learning (i.e., graph embedding). How well this useful information can be extracted is greatly influenced by the graph shape, i.e., topology, appropriate for the data characteristics. The topology defines the relationships between data points, i.e., the relationships between columns in the table-formatted input data. GCR provides a default radial graph topology with each sample's index as the central node, connected to each column node through edges. However, if the user has domain knowledge about the input data and can define additional relationships between columns to extract more effective information, the custom_connection argument is provided. For example, if it is deemed more effective to connect columns X1 and X2, and X3 and X4, you can specify [X1, X3] for custom_connection_lhs and [X2, X4] for custom_connection_rhs to update the topology. Note that the unique values in the right-hand nodes (X2, X4) must be equal to or greater than num_partitions; otherwise, an error will occur during graph embedding. Therefore, it is safe to apply this method when num_partitions is 1.

  • Argument type: Custom
  • Input type
    • list
  • Possible values
    • [] (default) or [X1, X2, ...]
    • Here, X1, X2 are the left-hand columns to be additionally connected.
  • Usage
    • custom_connection_lhs: [X1, X2, ...]
  • ui_args: X

custom_connection_rhs

  • Argument type: Custom
  • Input type
    • list
  • Possible values
    • [] (default) or [X3, X4, ...]
    • Here, X3, X4 are the right-hand columns to be additionally connected.
  • Usage
    • custom_connection_rhs: [X3, X4, ...]
  • ui_args: X

comparator

GCR's graph embedding adjusts the distances between nodes in the vector space based on their topology similarity over a specified number of epochs. The function that determines the similarity of the topology is the comparator. Comparators supported include dot, cos, l2, and squared_l2, with dot as the default. The choice of comparator depends on the nature of the problem being solved.

  • Argument type: Custom
  • Input type
    • string
  • Possible values
    • dot (default), cos, l2, squared_l2
  • Usage
    • comparator: dot
  • ui_args: X

loss_fn

GCR's graph embedding uses a negative sampling technique. The given input data is treated as positive samples, and the comparator is trained to increase their similarity. Simultaneously, negative samples (unlikely hypothetical data) are generated to decrease their similarity. The function that determines the difference in similarity between positive and negative samples across the entire graph is the loss_fn. Possible loss_fn options include ranking, logistic, and softmax, and the appropriate choice depends on the nature of the problem being solved.

  • Argument type: Custom
  • Input type
    • string
  • Possible values
    • ranking, logistic, softmax (default)
  • Usage
    • loss_fn: softmax
  • ui_args: X

lr

The learning rate applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the learning rate applied in general deep learning.

  • Argument type: Custom
  • Input type
    • float
  • Possible values
    • A real number greater than 0 and less than 1. 0.01 (default)
  • Usage
    • lr: 0.01
  • ui_args: X

batch_size

The batch size applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the batch size applied in general deep learning.

  • Argument type: Custom
  • Input type
    • int
  • Possible values
    • A positive integer. 1000 (default)
  • Usage
    • batch_size: 1000
  • ui_args: X

Train asset

task

Specify whether GCR's task is classification or regression.

  • Argument type: Required
  • Input type
    • string
  • Possible values
    • classification (default), regression
  • Usage
    • task: classification
  • ui_args: O

eval_metric

Select the evaluation metric for choosing the best model during HPO. If the task argument is classification, f1_score, accuracy, precision, and recall can be selected. If the task is regression, only rmse can be selected.

  • Argument type: Required
  • Input type
    • string
  • Possible values
    • If task is classification: f1_score (default), accuracy, precision, recall
    • If task is regression: rmse
  • Usage
    • eval_metric: f1_score
  • ui_args: O

num_hpo

Specify the number of trials for HPO.

  • Argument type: Required
  • Input type
    • int
  • Possible values
    • A positive integer. 20 (default)
  • Usage
    • num_hpo: 20
  • ui_args: O

Inference asset

global_xai

Specify whether to generate a global XAI result report file for the train set. If enabled, the file train_artifacts/models/train/global_feature_importance.csv is created in the working directory where ALO's main.py is located.

  • Argument type: Required
  • Input type
    • boolean
  • Possible values
    • True, False (default)
  • Usage
    • global_xai: False
  • ui_args: O

local_xai

Specify whether to perform local XAI for the inference set. If enabled, LIME-based XAI results are generated for all samples in the inference set and added as new columns to the input data. The added columns are as follows. This feature is currently only provided for classification tasks.

The example below shows the local XAI output for binary classification with inference data having columns X1~X9.

| Sample Index | classificationResult |

label category scores | top 5 reasons (column names and their values for the current sample) | |:---:|:---:|:---:|:---:| | 0 | 0 | 0.77, 0.23 | X1=0.1, X3=0.7, X4='A', X5='S', X9=0.02 | | 1 | 0 | 0.65, 0.35 | X3=0.6, X2=0.2, X1=0.7, X4='B', X8=0.01 | | 2 | 1 | 0.83, 0.17 | X4='B', X5='P', X9=0.07, X7='S', X1=0.3 |

  • Argument type: Required
  • Input type
    • boolean
  • Possible values
    • True, False (default)
  • Usage
    • local_xai: False
  • ui_args: O