GCR Parameter
Overview of experimental_plan.yaml
To apply AI content to your data, you need to input the data information and the desired content functions into the experimental_plan.yaml
file. After installing the AI content in the solution folder, you can find a pre-written experimental_plan.yaml
file for each content under the solution folder. By entering 'data information' and modifying/adding 'user arguments' provided by each asset in this YAML file, you can execute ALO to generate a data analysis model with the desired settings.
Structure of experimental_plan.yaml
The experimental_plan.yaml
includes various settings necessary to run ALO. By modifying the 'data path' and 'user arguments' among these settings, you can use the AI content immediately.
Inputting Data Paths (external_path
)
- The
external_path
parameter is used to specify the path of files to be loaded or the path where files will be saved. Ifsave_train_artifacts_path
andsave_inference_artifacts_path
are not specified, the modeling artifacts will be saved in the default pathstrain_artifacts
andinference_artifacts
folders, respectively.
external_path:
- load_train_data_path: ./solution/sample_data/train
- load_inference_data_path: ./solution/sample_data/test
- save_train_artifacts_path:
- save_inference_artifacts_path:
Parameter Name | DEFAULT | Description and Options |
---|---|---|
load_train_data_path | ./sample_data/train/ | Enter the folder path where the training data is located (do not include the csv file name). All csv files under the specified path are concatenated. |
load_inference_data_path | ./sample_data/test/ | Enter the folder path where the inference data is located (do not include the csv file name). All csv files under the specified path are concatenated. |
*All files under the specified path, including those in subfolders, are concatenated. | ||
*All columns in the files to be concatenated must be identical. |
User Parameters (user_parameters
)
- The
step
underuser_parameters
refers to the asset name. For example,step: input
refers to the input asset stage. args
refers to the user arguments of the input asset (step: input
). User arguments are data analysis-related setting parameters provided by each asset. Refer to the User arguments description below for details.
user_parameters:
- train_pipeline:
- step: input
args:
- file_type
...
ui_args:
...
User arguments explanation
What are User arguments?
User arguments are parameters for setting the operations of each asset, which are entered under args
in the respective asset steps of the experimental_plan.yaml
. Each asset in the AI content pipeline provides user arguments to apply various functions to the data. Refer to the guide below to change or add user arguments to create a model that fits your data.
User arguments are divided into 'required arguments' that are pre-written in the experimental_plan.yaml
and 'Custom arguments' that users can add by referring to the guide.
Required arguments
- Required arguments are the basic arguments that are immediately visible in the
experimental_plan.yaml
. Most required arguments have default values pre-set in the YAML file. - Users must enter values for the data-related arguments among the required arguments in the
experimental_plan.yaml
(e.g., x_columns, y_column).
Custom arguments
- Custom arguments are functions provided by the asset but not listed in the
experimental_plan.yaml
. Users can add these arguments to the YAML file's respective assetargs
.
The GCR pipeline consists of Input - Readiness - Graph - Modeling (train/inference) - Output assets, and the user arguments are configured differently for each asset's function. First, try modeling with the default required arguments settings in the experimental_plan.yaml
, and then add user arguments to create a GCR model that fits your data perfectly!
Summary of User arguments
Below is a summary of the user arguments for GCR. Click on the 'Argument Name' to navigate to its detailed explanation.
Default
- The 'Default' column indicates the default value of the user argument.
- If there is no default value, it is marked with '-'.
- If the default value is to leave it empty, it is marked as ' '.
- If there is logic behind the default value, it is marked as 'Refer to the description'. Click on the 'Argument Name' to see the detailed explanation.
ui_args
- The 'ui_args' column indicates whether the
ui_args
function is supported, allowing the argument value to be changed in the AI Conductor UI. - O: If you enter the argument name under
ui_args
in theexperimental_plan.yaml
, you can change the argument value in the AI Conductor UI. - X: The
ui_args
function is not supported. - For detailed explanation about
ui_args
, please refer to the following guide. Write UI Parameter
User Configuration Required
- The 'User Configuration Required' column indicates whether the user must check and change the argument before running the AI content.
- O: Generally, task and data-related information that users need to input before modeling.
- X: If the user does not change the value, the default value is used for modeling.
Asset Name | Argument Type | Argument Name | Default | Description | User Configuration Required | ui_args |
---|---|---|---|---|---|---|
Input | Required | file_type | csv | Input data file extension. | X | O |
Input | Required | encoding | utf-8 | Input data encoding type. | X | O |
Readiness | Required | x_columns | ' ' | List of x column names to be used for training. If left blank, all columns except y_column are used. | X | O |
Readiness | Required | drop_columns | ' ' | List of column names to exclude from x columns. | X | O |
Readiness | Required | y_column | - | Name of the y column. | O | O |
Graph | Required | dimension | 32 | Number of dimensions for graph embeddings. | X | O |
Graph | Required | num_epochs | 10 | Number of training epochs for graph embeddings algorithm. | X | O |
Graph | Required | num_partitions | 1 | Number of partitions to divide the input data for embedding. | X | O |
Graph | Required | use_gpu | False | Whether to use GPU for graph embedding in a GPU-available environment. | X | X |
Graph | Custom | workers | 1 | Number of processes for parallel execution during graph embedding. | X | X |
Graph | Custom | custom_connection_lhs | ' ' | Left-hand columns to be connected based on domain knowledge. | X | X |
Graph | Custom | custom_connection_rhs | ' ' | Right-hand columns to be connected based on domain knowledge. | X | X |
Graph | Custom | comparator | dot | Function to compare the similarity of two embeddings during graph embedding. | X | X |
Graph | Custom | loss_fn | softmax | Loss function for training during graph embedding. | X | X |
Graph | Custom | lr | 0.01 | Learning rate for training during graph embedding. | X | X |
Graph | Custom | batch_size | 1000 | Batch size for training during graph embedding. | X | X |
Train | Required | task | classification | Type of prediction task. | X | O |
Train | Required | eval_metric | f1_score | Evaluation metric for selecting the best model during HPO. | X | O |
Train | Required | num_hpo | 20 | Number of HPO trials. | X | O |
Inference | Required | global_xai | False | Whether to perform global XAI during inference. | X | O |
Inference | Required | local_xai | False | Whether to perform local XAI during inference. | X | O |
Detailed Explanation of User arguments
Input asset
file_type
Specify the file extension of the input data. Currently, AI Solution development only supports csv files.
- Argument type: Required
- Input type
- string
- Possible values
- csv (default)
- Usage
- file_type: csv
- ui_args: O
encoding
Specify the encoding type of the input data. Currently, AI Solution development only supports utf-8 encoding.
- Argument type: Required
- Input type
- string
- Possible values
- utf-8 (default)
- Usage
- encoding: utf-8
- ui_args: O
Readiness asset
x_columns
Enter the list of x column names in the dataframe. If left blank
, all columns except y_column are used as x columns.
- Argument type: Required
- Input type
- list
- Possible values
- Empty (default) or list of column names
- Usage
- x_columns: [col1, col2]
- ui_args: O
drop_columns
Enter the list of column names to exclude from x columns in the dataframe. If left blank, it means there are no columns to exclude.
- Argument type: Required
- Input type
- list
- Possible values
- Empty (default) or list of column names
- Usage
- drop_columns: [col1, col2]
- ui_args: O
y_column
Enter the name of the y column (label column) in the dataframe. The user must input the appropriate column name according to the data.
- Argument type: Required
- Input type
- string
- Possible values
- column name
- Usage
- y_column: target
- ui_args: O
Graph asset
dimension
Determine how many dimensions each column in the input data will be embedded into during graph embedding. The higher the dimension, the higher the vector separation, which increases model accuracy but requires more memory and longer embedding execution time.
- Argument type: Required
- Input type
- int
- Possible values
- 4, 8, 16, 32 (default), 64, 128, 256, 512, 1024
- Usage
- dimension: 32
- ui_args: O
num_epochs
Determine how many repeated attempts will be made during graph embedding. Typically, the loss value becomes saturated around 10 repetitions.
- Argument type: Required
- Input type
- int
- Possible values
- 1~100 (10 (default))
- Usage
- num_epochs: 10
- ui_args: O
num_partitions
GCR can perform graph embedding by dividing the entire input data into multiple pieces, reducing peak memory usage, and enabling operation in environments with limited memory. This argument determines into how many pieces the entire input data will be divided for embedding. The larger the num_partitions, the smaller the peak memory required, but the longer the time required to complete embedding the entire input data.
- Argument type: Required
- Input type
- int
- Possible values
- 1 (default), 2, 4, 8, 16, 32, 64, 128, 256, 512
- Usage
- num_partitions: 1
- ui_args: O
use_gpu
If GPU usage is available, setting use_gpu to True allows using the GPU for graph embedding.
- Argument type: Required
- Input type
- boolean
- Possible values
- True, False (default)
- Usage
- use_gpu: False
- ui_args: X
workers
Specify the number of processes for parallel execution during graph embedding.
- Argument type: Custom
- Input type
- int
- Possible values
- 0~inf, 1 (default)
- Usage
- workers: 1
- ui_args: X
custom_connection_lhs
GCR and other graph-powered machine learning models improve model accuracy by extracting useful information hidden in the data through graph representation learning (i.e., graph embedding). How well this useful information can be extracted is greatly influenced by the graph shape, i.e., topology, appropriate for the data characteristics. The topology defines the relationships between data points, i.e., the relationships between columns in the table-formatted input data. GCR provides a default radial graph topology with each sample's index as the central node, connected to each column node through edges. However, if the user has domain knowledge about the input data and can define additional relationships between columns to extract more effective information, the custom_connection argument is provided. For example, if it is deemed more effective to connect columns X1 and X2, and X3 and X4, you can specify [X1, X3] for custom_connection_lhs and [X2, X4] for custom_connection_rhs to update the topology. Note that the unique values in the right-hand nodes (X2, X4) must be equal to or greater than num_partitions; otherwise, an error will occur during graph embedding. Therefore, it is safe to apply this method when num_partitions is 1.
- Argument type: Custom
- Input type
- list
- Possible values
- [] (default) or [X1, X2, ...]
- Here, X1, X2 are the left-hand columns to be additionally connected.
- Usage
- custom_connection_lhs: [X1, X2, ...]
- ui_args: X
custom_connection_rhs
- Argument type: Custom
- Input type
- list
- Possible values
- [] (default) or [X3, X4, ...]
- Here, X3, X4 are the right-hand columns to be additionally connected.
- Usage
- custom_connection_rhs: [X3, X4, ...]
- ui_args: X
comparator
GCR's graph embedding adjusts the distances between nodes in the vector space based on their topology similarity over a specified number of epochs. The function that determines the similarity of the topology is the comparator. Comparators supported include dot, cos, l2, and squared_l2, with dot as the default. The choice of comparator depends on the nature of the problem being solved.
- Argument type: Custom
- Input type
- string
- Possible values
- dot (default), cos, l2, squared_l2
- Usage
- comparator: dot
- ui_args: X
loss_fn
GCR's graph embedding uses a negative sampling technique. The given input data is treated as positive samples, and the comparator is trained to increase their similarity. Simultaneously, negative samples (unlikely hypothetical data) are generated to decrease their similarity. The function that determines the difference in similarity between positive and negative samples across the entire graph is the loss_fn. Possible loss_fn options include ranking, logistic, and softmax, and the appropriate choice depends on the nature of the problem being solved.
- Argument type: Custom
- Input type
- string
- Possible values
- ranking, logistic, softmax (default)
- Usage
- loss_fn: softmax
- ui_args: X
lr
The learning rate applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the learning rate applied in general deep learning.
- Argument type: Custom
- Input type
- float
- Possible values
- A real number greater than 0 and less than 1. 0.01 (default)
- Usage
- lr: 0.01
- ui_args: X
batch_size
The batch size applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the batch size applied in general deep learning.
- Argument type: Custom
- Input type
- int
- Possible values
- A positive integer. 1000 (default)
- Usage
- batch_size: 1000
- ui_args: X
Train asset
task
Specify whether GCR's task is classification or regression.
- Argument type: Required
- Input type
- string
- Possible values
- classification (default), regression
- Usage
- task: classification
- ui_args: O
eval_metric
Select the evaluation metric for choosing the best model during HPO. If the task argument is classification, f1_score, accuracy, precision, and recall can be selected. If the task is regression, only rmse can be selected.
- Argument type: Required
- Input type
- string
- Possible values
- If task is classification: f1_score (default), accuracy, precision, recall
- If task is regression: rmse
- Usage
- eval_metric: f1_score
- ui_args: O
num_hpo
Specify the number of trials for HPO.
- Argument type: Required
- Input type
- int
- Possible values
- A positive integer. 20 (default)
- Usage
- num_hpo: 20
- ui_args: O
Inference asset
global_xai
Specify whether to generate a global XAI result report file for the train set. If enabled, the file train_artifacts/models/train/global_feature_importance.csv
is created in the working directory where ALO's main.py is located.
- Argument type: Required
- Input type
- boolean
- Possible values
- True, False (default)
- Usage
- global_xai: False
- ui_args: O
local_xai
Specify whether to perform local XAI for the inference set. If enabled, LIME-based XAI results are generated for all samples in the inference set and added as new columns to the input data. The added columns are as follows. This feature is currently only provided for classification tasks.
The example below shows the local XAI output for binary classification with inference data having columns X1~X9.
| Sample Index | classificationResult |
label category scores | top 5 reasons (column names and their values for the current sample) | |:---:|:---:|:---:|:---:| | 0 | 0 | 0.77, 0.23 | X1=0.1, X3=0.7, X4='A', X5='S', X9=0.02 | | 1 | 0 | 0.65, 0.35 | X3=0.6, X2=0.2, X1=0.7, X4='B', X8=0.01 | | 2 | 1 | 0.83, 0.17 | X4='B', X5='P', X9=0.07, X7='S', X1=0.3 |
- Argument type: Required
- Input type
- boolean
- Possible values
- True, False (default)
- Usage
- local_xai: False
- ui_args: O