Version: Next

GCR Parameter

Updated 2024.05.17

Overview of experimental_plan.yaml

To apply AI content to your data, you need to input the data information and the desired content functions into the experimental_plan.yaml file. After installing the AI content in the solution folder, you can find a pre-written experimental_plan.yaml file for each content under the solution folder. By entering 'data information' and modifying/adding 'user arguments' provided by each asset in this YAML file, you can execute ALO to generate a data analysis model with the desired settings.

Structure of experimental_plan.yaml

The experimental_plan.yaml includes various settings necessary to run ALO. By modifying the 'data path' and 'user arguments' among these settings, you can use the AI content immediately.

Inputting Data Paths (`external_path`)

The external_path parameter is used to specify the path of files to be loaded or the path where files will be saved. If save_train_artifacts_path and save_inference_artifacts_path are not specified, the modeling artifacts will be saved in the default paths train_artifacts and inference_artifacts folders, respectively.

external_path:
    - load_train_data_path: ./solution/sample_data/train
    - load_inference_data_path:  ./solution/sample_data/test
    - save_train_artifacts_path:
    - save_inference_artifacts_path:

Parameter Name	DEFAULT	Description and Options
load_train_data_path	./sample_data/train/	Enter the folder path where the training data is located (do not include the csv file name). All csv files under the specified path are concatenated.
load_inference_data_path	./sample_data/test/	Enter the folder path where the inference data is located (do not include the csv file name). All csv files under the specified path are concatenated.
*All files under the specified path, including those in subfolders, are concatenated.
*All columns in the files to be concatenated must be identical.

User Parameters (`user_parameters`)

The step under user_parameters refers to the asset name. For example, step: input refers to the input asset stage.
args refers to the user arguments of the input asset (step: input). User arguments are data analysis-related setting parameters provided by each asset. Refer to the User arguments description below for details.

user_parameters:
	- train_pipeline:
		- step: input
		  args:
			- file_type
			...
		  ui_args:
			...

User arguments explanation

What are User arguments?

User arguments are parameters for setting the operations of each asset, which are entered under args in the respective asset steps of the experimental_plan.yaml. Each asset in the AI content pipeline provides user arguments to apply various functions to the data. Refer to the guide below to change or add user arguments to create a model that fits your data. User arguments are divided into 'required arguments' that are pre-written in the experimental_plan.yaml and 'Custom arguments' that users can add by referring to the guide.

Required arguments

Required arguments are the basic arguments that are immediately visible in the experimental_plan.yaml. Most required arguments have default values pre-set in the YAML file.
Users must enter values for the data-related arguments among the required arguments in the experimental_plan.yaml (e.g., x_columns, y_column).

Custom arguments

Custom arguments are functions provided by the asset but not listed in the experimental_plan.yaml. Users can add these arguments to the YAML file's respective asset args.

The GCR pipeline consists of Input - Readiness - Graph - Modeling (train/inference) - Output assets, and the user arguments are configured differently for each asset's function. First, try modeling with the default required arguments settings in the experimental_plan.yaml, and then add user arguments to create a GCR model that fits your data perfectly!

Summary of User arguments

Below is a summary of the user arguments for GCR. Click on the 'Argument Name' to navigate to its detailed explanation.

Default

The 'Default' column indicates the default value of the user argument.
If there is no default value, it is marked with '-'.
If the default value is to leave it empty, it is marked as ' '.
If there is logic behind the default value, it is marked as 'Refer to the description'. Click on the 'Argument Name' to see the detailed explanation.

ui_args

The 'ui_args' column indicates whether the ui_args function is supported, allowing the argument value to be changed in the AI Conductor UI.
O: If you enter the argument name under ui_args in the experimental_plan.yaml, you can change the argument value in the AI Conductor UI.
X: The ui_args function is not supported.
For detailed explanation about ui_args, please refer to the following guide. Write UI Parameter

User Configuration Required

The 'User Configuration Required' column indicates whether the user must check and change the argument before running the AI content.
O: Generally, task and data-related information that users need to input before modeling.
X: If the user does not change the value, the default value is used for modeling.

Asset Name	Argument Type	Argument Name	Default	Description	User Configuration Required	ui_args
Input	Required	file_type	csv	Input data file extension.	X	O
Input	Required	encoding	utf-8	Input data encoding type.	X	O
Readiness	Required	x_columns	' '	List of x column names to be used for training. If left blank, all columns except y_column are used.	X	O
Readiness	Required	drop_columns	' '	List of column names to exclude from x columns.	X	O
Readiness	Required	y_column	-	Name of the y column.	O	O
Graph	Required	dimension	32	Number of dimensions for graph embeddings.	X	O
Graph	Required	num_epochs	10	Number of training epochs for graph embeddings algorithm.	X	O
Graph	Required	num_partitions	1	Number of partitions to divide the input data for embedding.	X	O
Graph	Required	use_gpu	False	Whether to use GPU for graph embedding in a GPU-available environment.	X	X
Graph	Custom	workers	1	Number of processes for parallel execution during graph embedding.	X	X
Graph	Custom	custom_connection_lhs	' '	Left-hand columns to be connected based on domain knowledge.	X	X
Graph	Custom	custom_connection_rhs	' '	Right-hand columns to be connected based on domain knowledge.	X	X
Graph	Custom	comparator	dot	Function to compare the similarity of two embeddings during graph embedding.	X	X
Graph	Custom	loss_fn	softmax	Loss function for training during graph embedding.	X	X
Graph	Custom	lr	0.01	Learning rate for training during graph embedding.	X	X
Graph	Custom	batch_size	1000	Batch size for training during graph embedding.	X	X
Train	Required	task	classification	Type of prediction task.	X	O
Train	Required	eval_metric	f1_score	Evaluation metric for selecting the best model during HPO.	X	O
Train	Required	num_hpo	20	Number of HPO trials.	X	O
Inference	Required	global_xai	False	Whether to perform global XAI during inference.	X	O
Inference	Required	local_xai	False	Whether to perform local XAI during inference.	X	O

Detailed Explanation of User arguments

Input asset

file_type

Specify the file extension of the input data. Currently, AI Solution development only supports csv files.

Argument type: Required
Input type
- string
Possible values
- csv (default)
Usage
- file_type: csv
ui_args: O

encoding

Specify the encoding type of the input data. Currently, AI Solution development only supports utf-8 encoding.

Argument type: Required
Input type
- string
Possible values
- utf-8 (default)
Usage
- encoding: utf-8
ui_args: O

Readiness asset

x_columns

Enter the list of x column names in the dataframe. If left blank

, all columns except y_column are used as x columns.

Argument type: Required
Input type
- list
Possible values
- Empty (default) or list of column names
Usage
- x_columns: [col1, col2]
ui_args: O

drop_columns

Enter the list of column names to exclude from x columns in the dataframe. If left blank, it means there are no columns to exclude.

Argument type: Required
Input type
- list
Possible values
- Empty (default) or list of column names
Usage
- drop_columns: [col1, col2]
ui_args: O

y_column

Enter the name of the y column (label column) in the dataframe. The user must input the appropriate column name according to the data.

Argument type: Required
Input type
- string
Possible values
- column name
Usage
- y_column: target
ui_args: O

Graph asset

dimension

Determine how many dimensions each column in the input data will be embedded into during graph embedding. The higher the dimension, the higher the vector separation, which increases model accuracy but requires more memory and longer embedding execution time.

Argument type: Required
Input type
- int
Possible values
- 4, 8, 16, 32 (default), 64, 128, 256, 512, 1024
Usage
- dimension: 32
ui_args: O

num_epochs

Determine how many repeated attempts will be made during graph embedding. Typically, the loss value becomes saturated around 10 repetitions.

Argument type: Required
Input type
- int
Possible values
- 1~100 (10 (default))
Usage
- num_epochs: 10
ui_args: O

num_partitions

GCR can perform graph embedding by dividing the entire input data into multiple pieces, reducing peak memory usage, and enabling operation in environments with limited memory. This argument determines into how many pieces the entire input data will be divided for embedding. The larger the num_partitions, the smaller the peak memory required, but the longer the time required to complete embedding the entire input data.

Argument type: Required
Input type
- int
Possible values
- 1 (default), 2, 4, 8, 16, 32, 64, 128, 256, 512
Usage
- num_partitions: 1
ui_args: O

use_gpu

If GPU usage is available, setting use_gpu to True allows using the GPU for graph embedding.

Argument type: Required
Input type
- boolean
Possible values
- True, False (default)
Usage
- use_gpu: False
ui_args: X

workers

Specify the number of processes for parallel execution during graph embedding.

Argument type: Custom
Input type
- int
Possible values
- 0~inf, 1 (default)
Usage
- workers: 1
ui_args: X

custom_connection_lhs

GCR and other graph-powered machine learning models improve model accuracy by extracting useful information hidden in the data through graph representation learning (i.e., graph embedding). How well this useful information can be extracted is greatly influenced by the graph shape, i.e., topology, appropriate for the data characteristics. The topology defines the relationships between data points, i.e., the relationships between columns in the table-formatted input data. GCR provides a default radial graph topology with each sample's index as the central node, connected to each column node through edges. However, if the user has domain knowledge about the input data and can define additional relationships between columns to extract more effective information, the custom_connection argument is provided. For example, if it is deemed more effective to connect columns X1 and X2, and X3 and X4, you can specify [X1, X3] for custom_connection_lhs and [X2, X4] for custom_connection_rhs to update the topology. Note that the unique values in the right-hand nodes (X2, X4) must be equal to or greater than num_partitions; otherwise, an error will occur during graph embedding. Therefore, it is safe to apply this method when num_partitions is 1.

Argument type: Custom
Input type
- list
Possible values
- [] (default) or [X1, X2, ...]
- Here, X1, X2 are the left-hand columns to be additionally connected.
Usage
- custom_connection_lhs: [X1, X2, ...]
ui_args: X

custom_connection_rhs

Argument type: Custom
Input type
- list
Possible values
- [] (default) or [X3, X4, ...]
- Here, X3, X4 are the right-hand columns to be additionally connected.
Usage
- custom_connection_rhs: [X3, X4, ...]
ui_args: X

comparator

GCR's graph embedding adjusts the distances between nodes in the vector space based on their topology similarity over a specified number of epochs. The function that determines the similarity of the topology is the comparator. Comparators supported include dot, cos, l2, and squared_l2, with dot as the default. The choice of comparator depends on the nature of the problem being solved.

Argument type: Custom
Input type
- string
Possible values
- dot (default), cos, l2, squared_l2
Usage
- comparator: dot
ui_args: X

loss_fn

GCR's graph embedding uses a negative sampling technique. The given input data is treated as positive samples, and the comparator is trained to increase their similarity. Simultaneously, negative samples (unlikely hypothetical data) are generated to decrease their similarity. The function that determines the difference in similarity between positive and negative samples across the entire graph is the loss_fn. Possible loss_fn options include ranking, logistic, and softmax, and the appropriate choice depends on the nature of the problem being solved.

Argument type: Custom
Input type
- string
Possible values
- ranking, logistic, softmax (default)
Usage
- loss_fn: softmax
ui_args: X

lr

The learning rate applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the learning rate applied in general deep learning.

Argument type: Custom
Input type
- float
Possible values
- A real number greater than 0 and less than 1. 0.01 (default)
Usage
- lr: 0.01
ui_args: X

batch_size

The batch size applied during GCR's graph embedding. Since GCR conducts graph embedding through deep learning based on Pytorch, think of the batch size applied in general deep learning.

Argument type: Custom
Input type
- int
Possible values
- A positive integer. 1000 (default)
Usage
- batch_size: 1000
ui_args: X

Train asset

task

Specify whether GCR's task is classification or regression.

Argument type: Required
Input type
- string
Possible values
- classification (default), regression
Usage
- task: classification
ui_args: O

eval_metric

Select the evaluation metric for choosing the best model during HPO. If the task argument is classification, f1_score, accuracy, precision, and recall can be selected. If the task is regression, only rmse can be selected.

Argument type: Required
Input type
- string
Possible values
- If task is classification: f1_score (default), accuracy, precision, recall
- If task is regression: rmse
Usage
- eval_metric: f1_score
ui_args: O

num_hpo

Specify the number of trials for HPO.

Argument type: Required
Input type
- int
Possible values
- A positive integer. 20 (default)
Usage
- num_hpo: 20
ui_args: O

Inference asset

global_xai

Specify whether to generate a global XAI result report file for the train set. If enabled, the file train_artifacts/models/train/global_feature_importance.csv is created in the working directory where ALO's main.py is located.

Argument type: Required
Input type
- boolean
Possible values
- True, False (default)
Usage
- global_xai: False
ui_args: O

local_xai

Specify whether to perform local XAI for the inference set. If enabled, LIME-based XAI results are generated for all samples in the inference set and added as new columns to the input data. The added columns are as follows. This feature is currently only provided for classification tasks.

The example below shows the local XAI output for binary classification with inference data having columns X1~X9.

| Sample Index | classificationResult |

label category scores | top 5 reasons (column names and their values for the current sample) | |:---:|:---:|:---:|:---:| | 0 | 0 | 0.77, 0.23 | X1=0.1, X3=0.7, X4='A', X5='S', X9=0.02 | | 1 | 0 | 0.65, 0.35 | X3=0.6, X2=0.2, X1=0.7, X4='B', X8=0.01 | | 2 | 1 | 0.83, 0.17 | X4='B', X5='P', X9=0.07, X7='S', X1=0.3 |

Argument type: Required
Input type
- boolean
Possible values
- True, False (default)
Usage
- local_xai: False
ui_args: O

Overview of experimental_plan.yaml​

Structure of experimental_plan.yaml​

Inputting Data Paths (external_path)​

User Parameters (user_parameters)​

User arguments explanation​

What are User arguments?​

Required arguments​

Custom arguments​

Summary of User arguments​

Default​

ui_args​

User Configuration Required​

Detailed Explanation of User arguments​

Input asset​

file_type​

encoding​

Readiness asset​

x_columns​

drop_columns​

y_column​

Graph asset​

dimension​

num_epochs​

num_partitions​

use_gpu​

workers​

custom_connection_lhs​

custom_connection_rhs​

comparator​

loss_fn​

lr​

batch_size​

Train asset​

task​

eval_metric​

num_hpo​

Inference asset​

global_xai​

local_xai​