Version: Next

TCR Parameter

Updated 2024.12.27

Overview of experimental_plan.yaml

To apply AI Contents to your data, you need to provide information about the data and the functionalities you wish to use in the experimental_plan.yaml file. Once you install AI Contents in the solution folder, you can find the pre-written experimental_plan.yaml file for each content under the solution folder. By inputting 'data information' and modifying/adding 'user arguments' for each asset in this YAML file, you can execute ALO to create a data analysis model with the desired settings.

Structure of experimental_plan.yaml

The experimental_plan.yaml file contains various settings required to run ALO. By modifying the 'data path' and 'user arguments' parts, you can immediately use AI Contents.

Input Data Path (`external_path`)

The external_path parameter is used to specify the path of the file to be loaded or the path where the file will be saved. If save_train_artifacts_path and save_inference_artifacts_path are not specified, the modeling artifacts will be saved in the default folders train_artifacts and inference_artifacts, respectively.

external_path:
    - load_train_data_path: ./solution/sample_data/train
    - load_inference_data_path:  ./solution/sample_data/test
    - save_train_artifacts_path:
    - save_inference_artifacts_path:

Parameter Name	DEFAULT	Description
load_train_data_path	./sample_data/train/	Specify the folder path where the training data is located. (Do not enter the CSV file name) All CSV files in the specified path will be concatenated.
load_inference_data_path	./sample_data/test/	Specify the folder path where the inference data is located. (Do not enter the CSV file name) All CSV files in the specified path will be concatenated.
* Files in subfolders under the specified path will also be included.
* All columns in the files to be concatenated must be identical.

User Parameters (`user_parameters`)

The step under user_parameters refers to the asset name. For example, step: input indicates the input asset stage.
args refers to the user arguments of the input asset (step: input). User arguments are data analysis-related setting parameters provided for each asset. See the User Arguments Description below for more details.

user_parameters:
    - train_pipeline:
        - step: input
          args:
            - file_type
            ...
          ui_args:
            ...

User Arguments Description

What are User Arguments?

User arguments are parameters for configuring the operation of each asset, written under the args of each asset step in the experimental_plan.yaml. Each asset constituting the AI Contents pipeline provides user arguments to apply various functionalities to the data. Users can refer to this guide to modify and add user arguments to create a model that fits their data.

User arguments are divided into 'required arguments' pre-written in the experimental_plan.yaml and 'custom arguments' added by the user referring to the guide.

Required Arguments

Required arguments are the basic arguments immediately visible in the experimental_plan.yaml. Most required arguments have default values set in the YAML file.
For some required arguments related to data, users must set the values. (e.g., x_columns, y_column)

Custom Arguments

Custom arguments are functionalities provided by assets but not written in the experimental_plan.yaml. Users can add them under the args of each asset in the YAML file.

The TCR pipeline consists of Input - Readiness - Preprocess - Modeling (train/inference) - Output assets, and user arguments are configured differently for each asset's functionality. First, try modeling with the required arguments settings written in the experimental_plan.yaml and then add user arguments to create a TCR model that perfectly fits your data!

Summary of User Arguments

Below is a summary of TCR's user arguments. Click on the 'Argument Name' to move to the detailed explanation of the argument. Currently, TCR provides user arguments only for the train pipeline. The inference pipeline automatically adopts the arguments settings used in the train pipeline. Therefore, you only need to write user arguments in the train pipeline.

Default

The 'Default' item indicates the default value of the user argument.
If there is no default value, it is indicated by '-'.
If there is a logic-based default, it is indicated by 'Refer to description'. Click on the 'Argument Name' to see the detailed explanation.

ui_args

The 'ui_args' in the table below indicate whether the ui_args functionality supports changing the argument values from the AI Conductor UI.
O: The argument value can be changed from the AI Conductor UI by writing the argument name under ui_args in the experimental_plan.yaml.
X: The ui_args functionality is not supported.
For detailed explanation of ui_args, refer to the Write UI Parameter guide.

Mandatory User Setting

The 'Mandatory User Setting' column indicates whether users must check and change the user argument to run AI Contents.
O: Typically, it contains arguments where users must input information about the task and data before modeling.
X: If the user does not change the value, modeling will proceed with the default value.

Asset Name	Argument Type	Argument Name	Default	Description	Mandatory User Setting	ui_args
Input	Required	file_type	csv	Specify the file extension of the input data.	X	O
Input	Required	encoding	utf-8	Specify the encoding type of the input data.	X	O
Readiness	Required	x_columns	-	Enter the names of the x columns for training.	O	O
Readiness	Required	y_column	-	Enter the name of the y column.	O	O
Readiness	Required	task_type	classification	Specify whether it is a classification or regression task.	O	O
Readiness	Required	target_label	_major	Enter the class name used as the metric calculation criterion for HPO.	X	X
Readiness	Required	column_types	auto	Enter the column types (categorical/numeric). 'auto' provides an automatic column type classification feature.	X	X
Readiness	Required	report	True	A summary CSV for the train/inference data will be generated.	X	O
Readiness	Custom	drop_x_columns	-	Use this instead of x_columns if there are many column names to enter.	X	O
Readiness	Custom	groupkey_columns	-	Group the dataframe based on the value of the entered column.	X	O
Readiness	Custom	min_rows	See the description	Specify the minimum number of rows required for training.	X	X
Readiness	Custom	cardinality	50	Specify the cardinality value for classifying categorical columns.	X	X
Readiness	Custom	num_cat_split	10	Adjust the classification criteria used in the automatic column type classification.	X	X
Readiness	Custom	ignore_new_category	False	Handle the situation where new categorical values are encountered during inference.	X	X
Preprocess	Custom	save_original_columns	True	Decide whether to retain the original training columns (x_columns) in the preprocess asset result dataframe.	X	O
Preprocess	Custom	categorical_encoding	{binary: all}	Specify the encoding method to apply to categorical columns.	X	X
Preprocess	Custom	handle_missing	See the description	Specify the missing value handling method to apply to columns.	X	X
Preprocess	Custom	numeric_outlier	-	Select the method for removing outliers from numeric columns.	X	X
Preprocess	Custom	numeric_scaler	-	Select the scaling method to apply to numeric columns.	X	X
Sampling	Required	data_split	{method: cross_validation, options: 3}	Select the method for splitting the train/validation set during HPO.	X	X
Sampling	Custom	over_sampling	-	Apply an over-sampling method for the y column labels.	X	X
Sampling	Custom	under_sampling	-	Apply an under-sampling method for the y column labels.	X	X
Sampling	Custom	random_state	-	A random seed is specified to obtain consistent results when performing sampling.	X	X
Train	Required	evaluation_metric	auto	Select the evaluation metric to choose the best model during HPO.	X	O
Train	Required	shapley_value	False	Decide whether to calculate and output shapley values in output.csv.	X	O
Train	Required	output_type	all	Decide whether to output minimal columns (modeling results) in output.csv.	X	O
Train	Custom	model_list	[rf, gbm, lgbm, cb, xgb]	Select the models to compare during HPO.	X	X
Train	Custom	hpo_settings	See the description	Modify parameters for the models in model_list.	X	X
Train	Custom	shapley_sampling	10000	Specify the sampling rate for calculating shapley values.	X	X
Train	Custom	multiprocessing	False	Specify whether to use multiprocessing.	X	O
Train	Custom	num_cpu_core	3	Specify the number of CPU cores to use during multiprocessing.	X	O

Detailed Description of User Arguments

Input asset

file_type

Specify the file extension of the input data. Currently, only CSV files are supported for AI Solution development.

Argument Type: Required
Input Type
- string
Possible Values
- csv (default)
Usage
- file_type: csv
ui_args: O

encoding

Specify the encoding type of the input data. Currently, only utf-8 encoding is supported for AI Solution development.

Argument Type: Required
Input Type
- string
Possible Values
- utf-8 (default)
Usage
- encoding: utf-8
ui_args: O

Readiness asset

x_columns

Enter the names of the x columns in the dataframe for training in list format. Users must input this according to their data. If there are many column names to enter, you can use the custom argument drop_x_columns to designate the entire dataframe columns as training columns. Only one of x_columns and drop_x_columns should be used (Remove or comment out the unused argument in the YAML file).

Argument Type: Required
Input Type
- list
Possible Values
- List of column names
Usage
- x_columns: [col1, col2]
ui_args: O

y_column

Enter the name of the y column in the dataframe. Users must input this according to their data.

Argument Type: Required
Input Type
- string
Possible Values
- Column name
Usage
- y_column: target
ui_args: O

task_type

Specify the type of the solution task (classification/regression). Users must check and set the value according to the purpose of the task.

Argument Type: Required
Input Type
- string
Possible Values
- classification (default)
- regression
Usage
- task_type: classification
ui_args: O

target_label

When training a classification model, this specifies the class of the y_column to be used as the evaluation metric calculation criterion during HPO. For example, if evaluation_metric is precision and target_label is 1, the model with the highest precision value for label 1 is selected as the best model. It does not work when task_type is regression.

Argument Type: Required
Input Type
- string
- list
Possible Values
- _major (default)
  - Select the class with the most occurrences in the y_column. (Both binary and multiclass are possible)
- _minor
  - Select the class with the fewest occurrences in the y_column. (Both binary and multiclass are possible)
- _all
  - Use all class names in the y_column as the criterion. (Only multiclass is possible)
- Class name
  - Enter one of the class names in the y_column. (Both binary and multiclass are possible)
  - e.g., target_label: setosa
- List of class names
  - list: Enter multiple class names in the y_column. (Only multiclass is possible)
  - e.g., target_label: [setosa, versicolor]
Usage
- target_label: _major
ui_args: X

column_types

Enter whether the training columns (x_columns) are numeric or categorical. The default value 'auto' allows the readiness asset to automatically classify whether the x_columns are numeric or categorical. If you need to always specify certain columns as numeric or categorical, use the column_type arguments as follows.

e.g., column_types: {categorical_columns: [col1, col2]}
e.g., column_types: {numeric_columns: [col1, col2]}
e.g., column_types: {categorical_columns: [col1], numeric_columns: [col2]}

Columns entered in column_types are always classified as the specified type, while columns not entered will be automatically classified by the auto logic as numeric or categorical.

Argument Type: Required
Input Type
- string
- dictionary
Possible Values
- auto (default)
- {categorical_columns: List of column names, numeric_columns: List of column names}
Usage
- column_types: auto
ui_args: X

report

It is determined whether to generate a summary CSV file for the input data (train/inference). The type of data (categorical/numeric), category information, cardinality, statistics, number of missing values, and missing rate will be recorded. ignore_new_category: When set to True, if category data that was not used in training is encountered during inference, a 'new-categories' column will be created.

Argument Type: Required
Input Type
- boolean
Possible Values
- True (default)
  - {train/inference}_artifacts/extra_output/readiness/report.csv will be generated.
- False
  - report.csv will not be generated.
Usage
- report: True
ui_args: O

drop_x_columns

If there are many column names to enter, use drop_x_columns instead of x_columns to designate the entire dataframe columns as training columns after dropping specific columns. Only one of x_columns and drop_x_columns should be used (Remove or comment out the unused argument in the YAML file). When drop_x_columns is [], all dataframe columns except groupkey_columns and y_column are used as training columns. When entering a list of columns to drop in drop_x_columns, the remaining columns except groupkey_columns and y_column are used as training columns. e.g., If the entire columns are x0,x1,x2,x3,x4,y and drop_x_columns=[x0], groupkey_columns=[x1], y_column=y, the training columns will be x2,x3,x4.

Argument Type: Custom
Input Type
- list
Possible Values
- []
  - Enter an empty list to use all dataframe columns except groupkey_columns and y_column as training columns.
- List of column names
  - Use the remaining columns except groupkey_columns, y_column, and the list of column names as training columns.
Usage
- drop_x_columns: []
ui_args: O

groupkey_columns

The groupkey function analyzes data by grouping it based on the value of specific columns. If the 'groupkey col' column is designated as the groupkey column in the table below, data with 'groupkey col' value A and B are modeled separately. In this case, 'groupkey col' is the groupkey column, and A,B are groupkeys. Use the groupkey_columns argument to apply the groupkey function.

x0	...	x10	groupkey col
..	..	..	A
..	..	..	A
..	..	..	B
..	..	..	B
..	..	..	A

If you enter multiple column names in groupkey_columns, the readiness asset generates a single integrated groupkey column by concatenating each value of groupkey_columns. For example, if you enter groupkey_columns: \[Gender, Pclass], a new groupkey column 'Gender_Pclass' is added to the input dataframe. However, for classification, the y_column's class types must be the same for each groupkey. Groups (groupkeys) that do not meet this condition are excluded from the training. Therefore, if the y_column values are A,B,C and a specific group's y_column values are only A,B, that group is excluded from the training. Other groups that do not meet the training conditions are also excluded by the readiness asset.

Argument Type: Custom
Input Type
- list
Possible Values
- List of column names
Usage
- groupkey_columns: [col1, col2]
ui_args: O

min_rows

It specifies the minimum number of rows required for training. If the training data does not meet the minimum row count, it will cause an error. The default values are set to 30 for classification and 100 for regression under task_type. If you use the default value, there must be at least 30 data per y label for classification, and the total number of data must be at least 100 for regression to be possible. If the user inputs a min_rows value, for classification, there must be at least as many data per y label as the entered min_rows, and for regression, the total number of data must be at least as many as the entered min_rows for training to be possible. For example, if the user does not use the default value and adds an argument to experimental_plan.yaml as min_rows: 50, for classification, there must be at least 50 data for each y label, and for regression, training will only proceed if there are at least 50 data in total.

When using the groupkey function (groupkey_columns), groups (groupkeys) that do not meet the min_rows condition are excluded from the training. If all groupkeys do not meet the min_rows condition, an error is raised.

Argument Type: Custom
Input Type
- int
Possible Values
- default
  - 30 (classification, there must be at least 30 instances of each y_column class)
  - 100 (regression, the total number of instances must be at least 100)
- Number value
Usage
- min_rows: 50
ui_args: X

cardinality

The cardinality argument specifies the cardinality condition that categorical_columns must meet when using the automatic classification feature for categorical/numeric columns (column_types: auto). If the unique values of a categorical column are equal to or less than the cardinality argument value, the column is classified as a categorical column. If the unique values exceed the cardinality argument value, the column is not classified as a categorical column and is excluded from the training columns.

Argument Type: Custom
Input Type
- int
Possible Values
- 50 (default)
- Number value
Usage
- cardinality: 50
ui_args: X

num_cat_split

The num_cat_split argument specifies the N value used in the automatic column type classification logic (column_types: auto) to determine the top N most frequent values for classifying whether a column is numeric or categorical.

Argument Type: Custom
Input Type
- int
Possible Values
- 10 (default)
- Number value
Usage
- num_cat_split: 10
ui_args: X

ignore_new_category

Handle the situation where new categorical values are encountered during inference. For example, if onehot encoding is applied to categorical columns during training, the model learns the columns based on the onehot encoded columns of the training data. If new categorical values not used in training are encountered during inference, they cannot be processed with the existing onehot encoding columns. Therefore, if there is a high likelihood of encountering new categorical values during inference, it is recommended to use the ignore_new_category argument to control the action.

Argument Type: Custom
Input Type
- boolean
- float
Possible Values
- False (default)
  - An error occurs if new categorical values not used in training are encountered during inference.
- True
  - During inference, if unseen categorical values are encountered, they are treated as missing values and inference proceeds. (Missing values are handled according to the missing value handling logic in the preprocessing asset.)
  - When using CatBoost encoding, unseen categorical data can be encoded without treating them as missing values. categorical_encoding
- 0-1 float value
  - e.g., 0.3
  - If the proportion of rows with unseen categorical values in the entire dataset is 0.3% or less, those values are treated as missing and inference proceeds.
  - If the ratio exceeds 0.3, an error occurs. (Groups are excluded from training if groupkey columns exist)
Usage
- ignore_new_category: False
ui_args: X

Preprocess asset

save_original_columns

The Preprocess asset applies various preprocessing methods to the training columns (x_columns). The save_original_columns argument decides whether to retain the original training columns (x_columns) in the result dataframe of the preprocess asset. Regardless of the save_original_columns setting, the training columns used in the next asset are replaced with the preprocessed columns.

Argument Type: Custom
Input Type
- boolean
Possible Values
- True (default)
  - Retain the original training columns (x_columns) along with the preprocessed columns in the next asset.
  - The original and preprocessed y columns are included in the dataframe.
- False
  - Only retain the preprocessed columns in the next asset. (Remove the original x_columns)
  - The original and preprocessed y columns are included in the dataframe.
Usage
- save_original_columns: True
ui_args: O

categorical_encoding

The categorical_encoding argument specifies the encoding method to apply to categorical columns. Enter it in the format of {method: value}. The 'value' can be a list of columns or 'all' to apply the method to all categorical columns. Currently, supported categorical encoding methods are as follows. categorical_encoding applies only to the training columns (x_columns). Include the categorical columns of x_columns. The y column uses label encoding for task_type: classification, and this cannot be changed.

binary: binary encoding
catboost: catboost encoding
onehot: onehot encoding
label: label encoding

By default, binary encoding is applied to all categorical training columns. When using categorical_encoding, if some columns are specified, the remaining columns are automatically applied with the default rule (binary). e.g., categorical_encoding: {label: [col1]} applies label encoding to col1, and binary encoding to the remaining categorical columns.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- default
  - x_columns
    - {binary: all}
  - y_column
    - label encoding applied
- {method1: list of columns, method2: list of columns}
Usage
- categorical_encoding: {binary: [col1], catboost: [col2]}
ui_args: X

handle_missing

The handle_missing argument specifies the method for handling missing values in categorical and numeric columns. Enter it in the format of {method: value}. The 'value' can be a list of columns, 'categorical_all', 'numeric_all', or 'all'. If handle_missing is not specified by the user, the default logic is applied to handle missing values in the training columns. When specifying some columns in handle_missing, the remaining training columns are automatically applied with the default rule. handle_missing applies only to the training columns (x_columns). Include the columns of x_columns. The y column's missing rows are automatically removed during the train pipeline.

Applicable to categorical columns only
- {method: value} value can be 'categorical columns list' or 'categorical_all' (categorical_all applies to all categorical columns)
- frequent: Fill missing values with the most frequent value in the column.
Applicable to numeric columns only
- {method: value} value can be 'numeric columns list' or 'numeric_all' (numeric_all applies to all numeric columns)
- mean: Fill missing values with the mean value of the column.
- median: Fill missing values with the median value of the column.
- interpolation: Fill missing values with the average of surrounding values in the column.
Applicable to all column types
- {method: value} value can be 'list of columns', 'all', 'categorical_all', 'numeric_all' (all applies to all columns)
- drop: Remove rows with missing values in the column.
- fill_value: Fill missing values with the specified 'value'.

Examples of using categorical_all, numeric_all, and all types are as follows.

handle_missing: {frequent: categorical_all, fill_0: numeric_all}
- Apply frequent to categorical columns (only applicable to categorical methods), and fill missing values with 0 for numeric columns.
handle_missing: {fill_0: categorical_all, fill_1: numeric_all}
- Fill missing values with 0 for categorical columns and 1 for numeric columns.
handle_missing: {fill_0: all}
- Fill missing values with 0 for all columns.
handle_missing: {fill_0: numeric_all}
- Fill missing values with 0 for numeric columns, and apply the default logic for categorical columns.

categorical_all and numeric_all can be used together, but categorical_all and all, numeric_all and all cannot be used together.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- default
  - x_columns
    - {frequent: categorical_all, median: numeric_all}
  - y_column
    - Apply drop
- {method1: list of columns, method2: list of columns}
Usage
- handle_missing: {fill_1: [col1], fill_2: [col2]}
ui_args: X

numeric_outlier

The numeric_outlier argument specifies the method for removing outliers from numeric columns. Enter it in the format of {method: value}. The 'value' can be a list of columns or 'all' to apply the method to all numeric columns. Currently, supported outlier removal methods are as follows. numeric_outlier applies only to the training columns (x_columns). Include the numeric columns of x_columns.

normal: Remove outliers beyond 3 standard deviations from the current data distribution.

numeric_outlier has no default value. In other words, if the user does not register it in the experimental_plan.yaml, no method is applied.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- No default
- {method: list of columns}
Usage
- numeric_outlier: {normal: [col1, col2]}
ui_args: X

numeric_scaler

The numeric_scaler argument specifies the scaling method to apply to numeric columns. Enter it in the format of {method: value}. The 'value' can be a list of columns or 'all' to apply the method to all numeric columns. Currently, supported scaling methods are as follows. numeric_scaler applies only to the training columns (x_columns). Include the numeric columns of x_columns.

standard: Scaling using mean and standard deviation. z=(x-u)/s (u: mean, s: std)
minmax: Scaling to maintain the distribution with a maximum value of 1 and a minimum value of 0.
robust: Scaling using median and interquartile range instead of mean and variance.
maxabs: Scaling to have the maximum absolute value of 1, with 0 remaining as 0.
normalizer: Normalization is performed per row instead of per column. Scaling is done so that the Euclidean distance of all features within a row is 1.

numeric_scaler has no default value. In other words, if the user does not register it in the experimental_plan.yaml, no method is applied.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- No default
- {method: list of columns}
Usage
- numeric_scaler: {standard: [col1], minmax: [col2]}
ui_args: X

Sampling asset

data_split

The data_split argument specifies the method for splitting the train/validation set during HPO. Enter it in the format of {method: method, options: value}. The possible 'method - value' combinations are as follows.

cross validation
- {method: cross_validation, options: 3}
- Use the cross-validation method, where options represent the kfold value. The example above is set to kfold 3.
train/test split
- {method: train_test, options: 0.3}
- Split the data into train/validation sets using sampling. The options value represents the proportion of the validation set. The example above splits the data with a 7:3 train:validation ratio.

You can check which cross-validation set each data point belongs to and whether it was used in the train or validation set in the 'data_split' column of output.csv.

Argument Type: Required
Input Type
- dictionary
Possible Values
- {method: cross_validation, options: 3} (default)
- {method: method, options: value}
Usage
- data_split: {method: cross_validation, options: 3}
ui_args: X

over_sampling

Apply an over-sampling method for the y_column labels. The over_sampling argument has two types based on the method for calculating the number of samples.

ratio: Over-sample the y_column labels to the specified ratio.

over_sampling: {
                method: random,
                label: B,
                ratio: 2
                }
# Randomly over-sample label B to be 2 times.

compare: Over-sample the y_column labels to be multiply times of the target label.

over_sampling: {
                method: random,
                label: B,
                compare: {
                        target: A,
                        multiply: 10
                        }
                }
# Randomly over-sample label B to be 10 times of label A.

Enter it in the format of {key: value}. Each key and value is described below.

key: method

Enter the over-sampling method. The possible methods are as follows.
- random: Apply random over-sampling.
- smote: Apply the smote method for over-sampling.

key: label

Enter the label of the y_column to apply the sampling method.
- Single label value. e.g., A
- If applying to multiple labels, enter them in a list. e.g., [A, B]

key: ratio (type1)

Over-sample each 'label' to the specified ratio.
- Enter a float value. e.g., 2.5
- Over-sampling is not applied if the ratio is less than or equal to 1.

key: compare (type2)

Over-sample each 'label' to be multiply times of the target label. Enter a sub-dictionary.
- sub_key: target
  - Enter the label to determine the number of samples.
  - Single label value. e.g., compare: {target: C ...}
- sub_key: multiply
  - Over-sample each label to be multiply times of the target label.
  - Enter a float value. e.g., label: [A, B], compare: {target: C, multiply: 10} - Over-sample labels A and B to be 10 times of label C.
  - If entering labels in a list, you can also enter multiply values in a list to apply them to each label. e.g., label: [A, B], compare: {target: C, multiply: [2, 3]}: Over-sample label A to be 2 times, and label B to be 3 times of label C.
  - Over-sampling is not applied if the number of samples exceeds the entered value. For example, if you want to make 100 samples by over-sampling but there are already 200 samples, over-sampling is not applied.

over_sampling has no default value. In other words, if the user does not register it in the experimental_plan.yaml, no method is applied.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- Enter the dictionary format described above.
- ratio type
  - {method: method, label: label name, ratio: float greater than 1}
  - e.g., {method: smote, label: A, ratio: 10} - Over-sample label A of the y_column to be 10 times using smote.
  - e.g., {method: random, label: [A,B], ratio: 10} - Over-sample labels A and B of the y_column to be 10 times using random over-sampling.
  - e.g., {method: smote, label: [A,B], ratio: [10,12]} - Over-sample label A to be 10 times and label B to be 12 times using smote.
- compare type
  - {method: method, label: label name, compare: {target: label name, multiply: float}}
  - e.g., {method: random, label: A, compare: {target: C, multiply: 5}} - Randomly over-sample label A to be 5 times of label C.
  - e.g., {method: random, label: [A,B], compare: {target: C, multiply: 5}} - Randomly over-sample labels A and B to be 5 times of label C.
  - e.g., {method: random, label: [A,B], compare: {target: C, multiply: [5,10]}} - Randomly over-sample label A to be 5 times and label B to be 10 times of label C.
Usage
- over_sampling: {method: smote, label: A, ratio: 10}
ui_args: X

under_sampling

Apply an under-sampling method for the y_column labels. The under_sampling argument has two types based on the method for calculating the number of samples.

ratio: Under-sample the y_column labels to the specified ratio.

under_sampling: {
                method: random,
                label: B,
                ratio: 0.5
                }
# Randomly under-sample label B to be 0.5 times.

compare: Under-sample the y_column labels to be multiply times of the target label.

over_sampling: {
                method: random,
                label: B,
                compare: {
                        target: A,
                        multiply: 2
                        }
                }
# Randomly under-sample label B to be 2 times of label A.
# Under-sampling is applied only if the number of samples is less than label B.

Enter it in the format of {key: value}. Each key and value is described below.

key: method

Enter the under-sampling method. The possible methods are as follows.
- random: Apply random under-sampling.
- nearmiss: Apply the nearmiss method for under-sampling to sample instances that are hard to distinguish from the minority class.

key: label

Enter the label of the y_column to apply the sampling method.
- Single label value. e.g., A
- If applying to multiple labels, enter them in a list. e.g., [A, B]

key: ratio (type1)

Under-sample each 'label' to the specified ratio.
- Enter a float value. e.g., 0.7
- Under-sampling is not applied if the ratio is more than or equal to 1.

key: compare (type2)

Under-sample each 'label' to be multiply times of the target label. Enter a sub-dictionary.
- sub_key: target
  - Enter the label to determine the number of samples.
  - Single label value. e.g., compare: {target: C ...}
- sub_key: multiply
  - Under-sample each label to be multiply times of the target label.
  - Enter a float value. e.g., label: [A,B], compare: {target: C, multiply: 0.5} - Under-sample labels A and B to be 0.5 times of label C.
  - If entering labels in a list, you can also enter multiply values in a list to apply them to each label. e.g., label: [A, B], compare: {target: C, multiply: [0.2, 0.3]}: Under-sample label A to be 0.2 times, and label B to be 0.3 times of label C.
  - Under-sampling is not applied if the number of samples exceeds the entered value. For example, if you want to make 100 samples by under-sampling but there are already 90 samples, under-sampling is not applied.

under_sampling has no default value. In other words, if the user does not register it in the experimental_plan.yaml, no method is applied.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- Enter the dictionary format described above.
- ratio type
  - {method: method, label: label name, ratio: float less than 1}
  - e.g., {method: nearmiss, label: A, ratio: 0.5} - Under-sample label A of the y_column to be 0.5 times using nearmiss.
  - e.g., {method: random, label: [A,B], ratio: 0.5} - Under-sample labels A and B of the y_column to be 0.5 times using random under-sampling.
  - e.g., {method: random, label: [A,B], ratio: [0.5,0.3]} - Under-sample label A to be 0.5 times and label B to be 0.3 times using random under-sampling.
- compare type
  - {method: method, label: label name, compare: {target: label name, multiply: float}}
  - e.g., {method: random, label: A, compare: {target: C, multiply: 0.5}} - Randomly under-sample label A to be 0.5 times of label C.
  - e.g., {method: random, label: [A,B], compare: {target: C, multiply: 0.5}} - Randomly under-sample labels A and B to be 0.5 times of label C.
  - e.g., {method: random, label: [A,B], compare: {target: C, multiply: [0.5,0.2]}} - Randomly under-sample label A to be 0.5 times and label B to be 0.2 times of label C.
Usage
- under_sampling: {method: nearmiss, label: A, ratio: 0.5}
ui_args: X

random_state

By specifying the random_state value, you can obtain the same results each time sampling is performed.

Argument Type: Custom
Input Type
- int
Possible Values
- Positive integer
Usage
- random_state: 123
ui_args: X

Train asset

evaluation_metric

Select the evaluation metric to choose the best model during HPO. The default value 'auto' uses accuracy for classification and mse for regression. If multiple models have the same evaluation_metric value during the HPO process, the priority of the models will be determined in the following order.

When the evaluation_metric values are the same:
- For classification, compare the other metrics in the order of accuracy, f1, recall, and precision (if accuracy is selected, compare values in the order of f1, recall, and precision).
- For regression, compare the other metrics in the order of r2, mse, mae, and rmse.
When all evaluation metric values are the same:
- The smaller the model size, the higher the priority. If model sizes are similar, they will be sorted in the order of rf, lgbm, gbm, xgb, and cb models.

However, if all evaluation metric values are the same and the user has manually added a model, the user-added model will have the highest priority.

Argument Type: Required
Input Type
- string
Possible Values
- auto (default)
  - task_type: classification: accuracy
  - task_type: regression: mse
- task_type: classification
  - accuracy
  - f1
  - recall
  - precision
- task_type: regression
  - mse
  - r2
  - mae
  - rmse
Usage
- evaluation_metric: auto
ui_args: O

shapley_value

Decide whether to calculate and output shapley values in output.csv. If shapley_value is set to True, a shap value summary plot is saved at {result folder}/extra_output/train/summary_plot.png. The summary plot helps to see the impact of each feature on the class.

Argument Type: Required
Input Type
- boolean
Possible Values
- False (default)
  - Do not calculate shapley values.
- True
  - Calculate shapley values.
Usage
- shapley_value: False
ui_args: O

output_type

Decide whether to output minimal columns (modeling results) or all columns in output.csv. The modeling result columns are as follows.

prob_{y class name},...
- The probability that the model classifies the data as a specific class. Columns are created for each class.
pred_{y column name}
- The predicted y value column of the model.
shap_{training column name}
- If shapley_value is True, shapley value columns are output. Columns are created for each training column (x_columns).

If output_type is set to 'all', the entire data and modeling result columns are output to output.csv. If output_type is set to 'simple', only the modeling result columns are output to output.csv. If the size of the data used for analysis is large, setting output_type to 'simple' can reduce the size of the output.csv file.

Argument Type: Required
Input Type
- string
Possible Values
- all (default)
  - Output the entire data and modeling result columns in output.csv.
- simple
  - Output only the modeling result columns in output.csv.
Usage
- output_type: all
ui_args: O

model_list

Enter the models to compare during HPO in list format. Currently, TCR includes five tree-based models, and if the user does not add the model_list argument, HPO is performed on all five models. The currently supported default models in TCR are as follows.

rf: random forest
gbm: gradient boosting machine
lgbm: light gradient boosting machine
cb: catoost
xgb: Extreme Gradient Boosting

If you enter an empty list ([]), the default ([rf, gbm, lgbm, cb, xgb]) is applied. Even if you enter values in hpo_settings, the model names must be included in the model_list to be added to the HPO. To add a newly created model to the HPO list by writing a model template during solution development, add the model abbreviation to the model_list.

Argument Type: Custom
Input Type
- list
Possible Values
- [rf, gbm, lgbm, cb, xgb] (default. [] has the same behavior)
Usage
- model_list: [rf, gbm, lgbm, cb, xgb]
ui_args: X

hpo_settings

Modify parameters for the models in the model_list. Enter it in the format of {model_name: {parameter1: search list, tcr_param_mix: 'one_to_one'}}.

{rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}}

In the example above, max_depth examines the values 100, 300, and 500, and n_estimators examines the values 300, 400, and 500. If the parameter value is a single number instead of a list, that value is fixed for the parameter. The possible values and functions of 't

cr_param_mix' are as follows.

one_to_one
- Each element corresponds one-to-one. The number of elements in the parameter values must be the same.
- For one_to_one, the example above becomes {max_depth: 100, n_estimators: 300, min_sample_leaf: 3}, {max_depth: 300, n_estimators: 400, min_sample_leaf: 3}, {max_depth: 500, n_estimators: 500, min_sample_leaf: 3}.
all
- Perform HPO with all combinations of the parameter lists.
- For all, the example above becomes {max_depth: 100, n_estimators: 300, min_sample_leaf: 3}, {max_depth: 100, n_estimators: 400, min_sample_leaf: 3},...,{max_depth: 500, n_estimators: 500, min_sample_leaf: 3}.

If a model is in model_list but not in hpo_settings, the default parameters in the model file are used. In other words, if the model_list is the default (5 models) and {rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}} is entered, the other 4 models use the default parameters in the model file.

Argument Type: Custom
Input Type
- dictionary
Possible Values
- Use the default parameter set in the model file (default)
- {model_name: {parameter1: search list, tcr_param_mix: one_to_one or all}}
Usage
- hpo_settings: {rf: {max_depth: [100, 300], n_estimators: 300, tcr_param_mix: one_to_one}}
ui_args: X

shapley_sampling

When shapley_value is set to True, you can sample some data to calculate the shapley values instead of sampling all data. If there are many data points, calculating shapley values for all points can take a long time, so sampling some data can reduce training time.

Argument Type: Custom
Input Type
- float
- int
Possible Values
- 10000 (default)
- 0-1 float
  - Sample the specified proportion of data.
- 1
  - Sample all data.
- Integer greater than 1
  - Sample the specified number of data points.
Usage
- shapley_sampling: 10000
ui_args: X

multiprocessing

Specify whether to use multiprocessing. The default value False means multiprocessing is not used. Currently, multiprocessing is not recommended for Mellerikat.

Argument Type: Custom
Input Type
- boolean
Possible Values
- False (default)
- True
Usage
- multiprocessing: False
ui_args: O

num_cpu_core

Specify the number of CPU cores to use during multiprocessing.

Argument Type: Custom
Input Type
- int
Possible Values
- 3 (default)
- Integer greater than 0
Usage
- num_cpu_core: 3
ui_args: O

TCR Version: 2.2.3

Overview of experimental_plan.yaml​

Structure of experimental_plan.yaml​

Input Data Path (external_path)​

User Parameters (user_parameters)​

User Arguments Description​

What are User Arguments?​

Required Arguments​

Custom Arguments​

Summary of User Arguments​

Default​

ui_args​

Mandatory User Setting​

Detailed Description of User Arguments​

Input asset​

file_type​

encoding​

Readiness asset​

x_columns​

y_column​

task_type​

target_label​

column_types​

report​

drop_x_columns​

groupkey_columns​

min_rows​

cardinality​

num_cat_split​

ignore_new_category​

Preprocess asset​

save_original_columns​

categorical_encoding​

handle_missing​

numeric_outlier​

numeric_scaler​

Sampling asset​

data_split​

over_sampling​

under_sampling​

random_state​

Train asset​

evaluation_metric​

shapley_value​

output_type​

model_list​

hpo_settings​

shapley_sampling​

multiprocessing​

num_cpu_core​