Version: Next

TCR Parameter

Updated 2025.04.29

experimental_plan.yaml Explanation

To apply AI Contents to your data, you need to write information about the data and the Contents features you want to use in the experimental_plan.yaml file. If you install AI Contents in the solution folder, you can see the experimental_plan.yaml file created by default for each content under the solution folder. By entering the 'data information' in this YAML file and modifying/adding the 'user arugments' provided for each function, you can run ALO to create a data analysis model with the desired settings.

experimental_plan.yaml structure

experimental_plan.yaml contains various settings required to run ALO. If you modify the 'dataset_uri' and 'function' parts of the train/inference section of this setting value, you can start using AI Contents right away.

Enter data path ('dataset_uri')

The parameter of 'train' or 'inference' is used to specify the path to the file to load or the file to save.

train:
dataset_uri: [train_dataset/] # Data folder, folder list (no file types)
# dataset_uri: s3://mellerikat-test/tmp/alo/ # Example 1) S3 key(prefix) or lower path All folders, files
artifact_uri: train_artifact/
pipeline: [input, readiness, preprocess, sampling, train] # Execution target: List of functions

inference:
dataset_uri: inference_dataset/
# model_uri: model_artifacts/n100_depth5.pkl # load the already trained model
artifact_uri: Define compression and upload paths for files stored below the path inference_artifact/ # Optional) pipeline['artifact']['workspace']. Uploaded as inferenece.tar.gz below the default path
pipeline: [input, readiness, preprocess, sampling, inference]

Parameter Name	DEFAULT	Description & Options
dataset_uri (train)	[train_dataset/]	Enter the folder path where the training data is located. (Enter the csv file name X) and concat all the csv under the path you entered.
dataset_uri (inference)	inference_dataset	Enter the folder path where the inference data is located. (Enter the csv file name X) and concat all the csv under the path you entered.

*Imports all files in the entered path subfolder and merges them. *All columns in the merged file must have the same name.

User Parameter ('function')

The fields below 'function' refer to the function name. The 'function: input' below means input step.
'argument' means the user arguments of the input function('function: input'). User arguments are settings for data analysis provided for each function. For an explanation of this, please see the User arguments explanation below.

function: # Required) user-defined function
input:
def: pipeline.input # {python filename}. {function name}
argument:
file_type: csv
encoding: utf-8
            ...

User arguments explained

What are User arguments?

User arguments are used as parameters for setting the behavior of each asset, and they are written under the 'args' of each step in the experimental_plan.yaml. Each function that makes up the pipeline of AI Contents provides user arguments so that users can apply various functions to their data. Users can change and add user arguments by referring to the guide below to model according to their data. User arguments are divided into 'required arguments', which are pre-written in the experimental_plan.yaml, and 'Custom arguments', which are added by the user after viewing the guide.

Required arguments

Required arguments are the default arguments that are shown directly in the experimental_plan.yaml. Most required arguments have a default value set in the YAML file.
Data-related arguments in the experimental_plan.yaml must be set by the user. (ex. x_columns, y_column)

Custom arguments

Custom arguments are not written in the experimental_plan.yaml, but they are provided by each function and can be used by the user in addition to the experimental_plan.yaml. I use it by adding it to the 'argument' for each function in the YAML file.

TCR's pipeline is composed of Input - Readiness - Preprocess - Modeling(train/inference) functions, and user arguments are configured differently according to the function of each function. Start by modeling with the required arguments settings in your experimental_plan.yaml, then add user arguments to create a TCR model that fits your data!

train:
  ...
pipeline: [input, readiness, preprocess, sampling, train] # list of functions to be executed

inference:
  ...
pipeline: [input, readiness, preprocess, sampling, inference] # list of functions to be executed

Summary of User arguments

Below is a summary of user arguments in TCR. You can click on 'Argument name' to go to the detailed description of the argument. Currently, TCR only provides user arguments for train pipelines. The Inference Pipeline automatically fetches and uses the arguments settings used by the train. Therefore, you only need to write user arguments for train pipelines.

Default

The 'Default' field is the default value for the user argument.
If there is no default value, it is written as '-'.
If there is logic in default, it is marked as 'Note explanation'. Click 'Argument Name' to see the detailed description

ui_args

The 'ui_args' in the table below indicates whether the 'ui_args' function is supported, which allows you to change the argument value in the UI of AI Conductor.
O: If you enter the argument name under 'ui_args' in the experimental_plan.yaml, you can change the arguments value in the AI Conductor UI.
X: Doesn't support the 'ui_args' feature.
For a more detailed explanation of "ui_args", please check out the following guide. [Write UI Parameter](.. /.. /alo/alo-v3/register_ai_solution/write_ui_parameter)

User settings required?

In the table below, 'User Settings Required' is the user arguments that the user must check and change in order to make AI Contents work.
O: Arguments, usually assignments, where you enter data-related information, which you need to check before modeling.
X: If the user does not change the value, the modeling proceeds to the default value.

Step Name	Argument type	Argument	Default	Description	User setup required?	ui_args
Input	Required	file_type	csv	Enter the file extension for input data.	X	O
Input	Required	encoding	utf-8	Enter the encoding type of input data.	X	O
Readiness	Required	x_columns	-	Enter the name of the x-column to learn.	O	O
Readiness	Required	y_column	-	y Enter a column name.	O	O
Readiness	Required	task_type	classification	Classification/Regression.	O	O
Readiness	Required	target_label	_major	Enter the class name that is the basis for calculating the metric in HPO.	X	X
Readiness	Required	column_types	auto	Enter column type (categorical/numeric) information (automatic column type classification function 'auto' provided)	X	X
Readiness	Required	report	True	Output a summary csv for train/inference data.	X	O
Readiness	Custom	drop_x_columns	-	If you have a large number of column names to enter, use x_columns instead.	X	O
Readiness	Custom	groupkey_columns	-	Group the dataframe based on the value value of the column you entered	X	O
Readiness	Custom	min_rows	Note the explanation	Specifies the minimum number of rows required at training time.	X	X
Readiness	Custom	cardinality	50	Categorical columns are categorized based on the value entered.	X	X
Readiness	Custom	num_cat_split	10	This is the value used when automatically classifying column types, and adjusts the classification criteria.	X	X
Readiness	Custom	ignore_new_category	False	Handles behavior when new category values come in during inference.	X	X
Preprocess	Custom	save_original_columns	True	Decide whether to leave the source learning column (x_columns) in the Preprocess Asset resulting dataframe.	X	O
Preprocess	Custom	categorical_encoding	{binary: all}	Specifies the encoding methodology to apply to the categorical column.	X	X
Preprocess	Custom	handle_missing	Note the explanation	Specifies how to handle missing values to apply to the column.	X	X
Preprocess	Custom	numeric_outlier	-	Select the outlier removal method to apply to the numeric column.	X	X
Preprocess	Custom	numeric_scaler	-	numeric columns.	X	X
Sampling	Required	data_split	{method: cross_validation, options: 3}	For HPO, select the train/validation set segmentation methodology.	X	X
Sampling	Custom	over_sampling	-	Y Column's Over Sampling methodology by label.	X	X
Sampling	Custom	under_sampling	-	Y column's Under Sampling methodology by label.	X	X
Sampling	Custom	random_state	-	When you perform sampling, specify a random seed to get the same result.	X	X
Train	Required	evaluation_metric	auto	Determine the evaluation metric for selecting the best model for HPO.	X	O
Train	Required	shapley_value	False	Determines whether to compute a shapley value and output it to the output.csv.	X	O
Train	Required	output_type	all	Choose whether to output only the minimum number of columns (modeling results) to the output.csv.	X	O
Train	Custom	model_list	[rf, gbm, lgbm, cb, xgb]	Select the model you want to compare with HPO.	X	X
Train	Custom	hpo_settings	Note the explanation	Change the parameter for the model in the model_list.	X	X
Train	Custom	shapley_sampling	10000	shapley_value Enter the degree of data sampling when extracting the value.	X	X
Train	Custom	multiprocessing	False	Enter whether or not multiprocessing is enabled.	X	O
Train	Custom	num_cpu_core	3	Enter the number of CPU cores to use in multiprocessing.	X	O

User arguments in detail

Input asset

file_type

Enter the file extension for Input data. Currently, AI Solution development is only available in csv files.

Argument type: Required
Input type
string
Enterable values
csv (default) -usage
file_type: csv
ui_args: O

encoding

Enter the encoding type of input data. Currently, AI Solution development is only available for UTF-8 encoding.

Argument type: Required
Input type
string
Enterable values
utf-8 (default) -usage
encoding: utf-8
ui_args: O

Readiness asset

x_columns

Enter the name of the training target x column in the Dataframe in the form of a list. The user must enter the input data correctly. If you have a large number of columns to enter, you can use the custom argument drop_x_columns to target the entire DataFrame for training. However, you must use either x_columns or drop_x_columns (arguments that do not use either should be removed or commented out in the YAML file).

Argument type: Required
Input type
list
Enterable values
Column name list -usage
x_columns: [col1, col2]
ui_args: O

y_column

Enter one of the names of the y column in the Dataframe. The user must enter the input data correctly.

Argument type: Required
Input type
string
Enterable values
Column Name -usage
y_column: target
ui_args: O

task_type

Enter the type of Solution assignment (classification/regression). You must ensure that the value is set to the task_type that the user is using for the purpose of the assignment.

Argument type: Required
Input type
string
Enterable values
classification (default)
regression -usage
task_type: classification
ui_args: O

target_label

When training the classification model, determine the class of the y_column that is the basis for calculating the model evaluation metric at HPO. For example, if evaluation_metric is precision and target_label is 1, the model with the highest precision value of label 1 is selected as the best model. task_type does not work when regression.

Argument type: Required
Input type
string
list
Enterable values
_major (default)
Select the class with the highest number in the y_column. (binary, multiclass, all possible)
_minor
Select the class with the fewest number in the y_column. (binary, multiclass, all possible)
_all
Based on all class names in the y_column. (multiclass only)
Label values
Enter one class name for the y_column. (binary, multiclass, all possible)
ex) target_label: setosa
Label value list
list: Enter multiple y_column classes. (multiclass only)
ex) target_label: [setosa, versicolor] -usage
target_label: _major
ui_args: X

column_types

Enter whether the training column (x_columns) type is numeric or categorical. If you use the default value of 'auto', the readiness asset will automatically classify whether the x_columns is numeric/categorical. If you need to always specify a particular column as numeric or categorical, use column_type arguments as shown below.

ex) column_types: {categorical_columns: [col1, col2]}
ex) column_types: {numeric_columns: [col1, col2]}
ex) column_types: {categorical_columns: [col1], numeric_columns: [col2]}

Columns entered into the column_types are categorical or numeric columns as specified by the user, and columns that are not entered are automatically classified as numeric/categorical columns by auto logic.

Argument type: Required
Input type
string
dictionary
Enterable values
auto (default)
{categorical_columns: column names list, numeric_columns: column names list} -usage
column_types: auto
ui_args: X

report

Decide whether to create a summary csv file for input data(train/infernece). The type of data (categorical/numeric), category information, cardinality, statistics, missing value count, and missing percentage are recorded. ignore_new_category: If true inference contains category data that was not used for training, a 'new-categories' column will be created.

Argument type: Required
Input type
boolean
Enterable values
True (default)
Create {train/inference}_artifacts/extra_output/readiness/report.csv.
False
It does not create report.csv. -usage
report: True
ui_args: O

drop_x_columns

If you have a large number of column names to enter, you can use drop_x_columns instead of x_columns to specify the training columns by loading the entire column names in the dataframe and then deleting only some of the columns. You must use either x_columns or drop_x_columns (arguments that are not used should be removed or commented out in the YAML file). When drop_x_columns is [], use the entire DataFrame column as the training column (training columns: groupkey_columns, all but y_column columns). If you enter a list of columns to exclude from training in drop_x_columns, the remaining columns except the columns entered in drop_x_columns are used as training columns (training columns: groupkey_columns, y_column, and drop_x_columns columns). ex) Whole column: x0,x1,x2,x3,x4,y, drop_x_columns=[x0], groupkey_columns=[x1], y_column=y, the column to be trained will be x2,x3,x4.

Argument type: Custom
Input type
list
Enterable values
[]
If you enter an empty list, the entire DataFrame column is used as the training column except for groupkey_columns and y_column.
Column name list
Use the rest of the columns in the DataFrame as training columns, except for the groupkey_columns, y_column, and column namelist. -usage
drop_x_columns: []
ui_args: O

groupkey_columns

The groupkey function is a way to analyze data by grouping it based on the value of a specific column. In the table below, if you specify 'groupkey col' as the groupkey column, the data with a value of A and the data with a value of B in the 'groupkey col' column will be modeled. In this case, 'groupkey col' becomes groupkey column, and A and B become groupkey. groupkey_columns argument allows you to use the groupkey function.

x0	...	x10	groupkey col
..	..	..	A
..	..	..	A
..	..	..	B
..	..	..	B
..	..	..	A

If you enter multiple column names as lists in groupkey_columns, the readiness asset creates a single unified groupkey column that concates each value in the groupkey_columns. For example, if you enter 'groupkey_columns: [Gender, Pclass]', a new groupkey column named 'Gender_Pclass' will be added to the input dataframe. However, in the case of classification, the class type of the y_column must be the same for each groupkey. Groups with different class types in the y_column (groupkey) are excluded from training. This means that when the y_column value consists of A,B,C, if the y_column of a particular group consists of only A,B, that group is excluded from learning. Other groups that do not meet the learning criteria in the readiness asset (groupkey) are excluded from the training.

Argument type: Custom
Input type
list
Enterable values
Column name list -usage
groupkey_columns: [col1, col2]
ui_args: O

min_rows

Specifies the minimum number of rows required at training time. If the training data doesn't meet the minimum number of rows, it raises an error. The default value is task_type: 30 for classification and 100 for regression. If you use the default value, there must be more than 30 data per y label at the time of classification, and the total number of data must be more than 100 at the time of regression to be able to learn. If the user enters min_rows value, the number of data must exist as much as the min_rows entered for each y label in the case of classification, and the total number of data must exist at least min_rows entered in the case of regression to be able to learn. For example, if you don't use the default value and add an argument to the experimental_plan.yaml with "min_rows: 50", then training will only occur if there is at least 50 data for each y label for classification, and more than 50 for the entire data for regression.

If you use the groupkey function ([groupkey_columns](#groupkey_columns>, if a specific groupkey does not meet the min_rows conditions, the groupkey will be excluded from the groupkey to be trained. If you use the groupkey function, all groupkeys will cause an error if they do not meet the min_row conditions.

Argument type: Custom
Input type
int
Enterable values
default
30 (classification, the number of data per class in the y_column must be at least 30)
100 (regression, total data count must be at least 100)
Numeric values -usage
min_rows: 50
ui_args: X

cardinality

Cardinality conditions that categorical_columns must meet when using the categorical/numeric column auto-classification feature. The unique value of the categorical column must be less than or equal to the specified cardinality argument value for the final classification as a categorical column. If the number of unique values in a particular column is greater than the cardinality argument, the column is not classified as a categorical column and is excluded from the training column.

Argument type: Custom
Input type
int
Enterable values
50 (default)
Numeric values -usage
cardinality: 50
ui_args: X

num_cat_split

When using the categorical/numeric column auto-classification function (column_types: auto), numeric/categorical columns are classified by checking whether the frequently appearing top N data is numeric/object. num_cat_split specifies an N-value.

Argument type: Custom
Input type- int
Enterable values
10 (default)
Numeric values -usage
num_cat_split: 10
ui_args: X

ignore_new_category

Controls the behavior when a value that has not been used for training comes into the categorical column during inference. For example, if you apply onehot encoding to a categorical column in train, the model will learn the onehot encoded column based on the train data. If a new category value is introduced to the categorical column during inference that has not been used for training, the category value cannot be processed by the previously trained 1Hot Encoding column. Therefore, if there is a high probability that category columns that were not used for training will come in during inference, it is better to use ignore_new_category arguments to control the behavior.

Argument type: Custom
Input type
boolean
float
Enterable values
False (default)
If a category value that is not used for training comes in during inference, an error will be raised.
True
If a category value that is not used for training comes in during inference, fill in the missing value to proceed with inference. (Missing values are processed according to the processing logic of the preprocess asset.)
Catboost encoding allows encoding without missing value processing for category data that is not used for training. categorical_encoding
Float values between 0 and 1
ex) 0.3
If the proportion of rows with category values that are not used for training in the total data is less than 0.3, fill in the missing values to perform inference.
If the percentage of rows with category values that are not used for training in the total data is greater than or equal to 0.3, an error will be raised. (groupkeycolumn, if any, the group is excluded from training.) -usage
ignore_new_category: False
ui_args: X

Preprocess asset

save_original_columns

Preprocess assets apply various preprocessing methodologies to the columns (x_columns) to be trained. save_original_columns chooses whether to paste the source of the x_columns, the column to be trained, to the resulting dataframe of the preprocess asset and pass it to the next asset. save_original_columns or not, the training target columns used by the following assets are replaced by the preprocessed learning columns in the preprocess asset.

Argument type: Custom
Input type
boolean
Enterable values
True (default)
Pass the unused column for training + the column to be trained (x_columns) + the preprocessed training column to the next asset.
The y-column and the preprocessed y-column are included in the dataframe.
False
Pass the unused columns for training + the preprocessed training columns to the next asset. (x_columns original deleted)
The y-column and the preprocessed y-column are included in the dataframe. -usage
save_original_columns: True
ui_args: O

categorical_encoding

categorical_encoding specifies the encoding methodology to apply to the category column. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire categorical column. The categorical encodings currently supported are as follows: categorical_eocnding only works for x_columns, which is the column to be trained. Add the categorical column of the x_columns. The Y column is task_type: For classification, the label encoding is applied and cannot be changed.

binary: binary encoding
catboost: catboost encoding
onehot: onehot encoding
label: label encoding

Currently, by default, binary encoding is applied to all categorical columns for training. When using categorical_encoding, if you specify some columns, the rest of the columns will automatically be subject to the default rule (binary). For example, if categorical_encoding: {label: [col1]}, then binary encoding is applied to all categorical columns except col1.

Argument type: Custom
Input type
dictionary
Enterable values
default
x_columns
{binary: all}
y_column
Label encoding
{Methodology 1: Column list, Methodology 2: Column list} -usage
categorical_encoding: {binary: [col1], catboost: [col2]}
ui_args: X

handle_missing

handle_missing specifies how missing values should be handled for categorical and numeric columns. Enter in the form of a dictionary of {Methodology: Value}. Different categorical/numeric column types have different methodologies that can be applied. For 'Value', you can use 'column list' and 'categorical_all', 'numeric_all', and 'all'. handle_missing applies default logic to the missing values of the columns to be trained unless the user specifies a methodology. When you enter some columns in the handle_missing, the remaining columns are automatically subject to the default rule. **handle_missing only works for x_columns, which is the column to be trained. x_columns columns. ** In train pipelines, the missing value rows of the y column are automatically deleted.

Categorical columns only
In the value {Methodology: Values}, you can enter "categorical column list" or "categorical_all". (categorical_all entire categorical column at input)
frequent: Fills in missing values with the most common occurrences in the column.
Numeric columns only
{Methodology: Values} You can enter 'numeric column list' and 'numeric_all' in the value. (numeric_all Numeric column as input)
mean: Populates the missing values with the average value of the column.
median: Fills in missing values based on the median of the column.
Interpolation: Populates the value by the average of the values around the missing row in the column.
Methodology applicable to all column types
In the value {Methodology: Values}, you can enter 'column list' and 'all', 'categorical_all', and 'numeric_all'. (All columns when all is entered)
drop: Removes rows with missing values in the column.
fill_ value: Populates the missing value of the column with the value entered in 'Value'.

categorical_all, numeric_all, and all types are used as follows.

handle_missing: {frequent: categorical_all, fill_0: numeric_all}
Fill categorical columns with frequent (available only for categorical methodology) and numeric columns with missing values with zeros.
handle_missing: {fill_0: categorical_all, fill_1: numeric_all}
Fill the missing values with 0 for categorical columns and 1 for numeric columns.
handle_missing: {fill_0: all}
Fill all columns with missing zeros.
handle_missing: {fill_0: numeric_all}
Numeric columns are filled with zeros, and categorical columns are given default logic.

categorical_all and numeric_all can be used together, but categorical_all and all, or numeric_all and all, cannot be used together.

Argument type: Custom
Input type
dictionary
Enterable values
default
x_columns
{frequent: categorical_all, median: numeric_all}
y_column
Drop application - {Methodology 1: Column list, Methodology 2: Column list} -usage
handle_missing: {fill_1: [col1], fill_2: [col2]}
ui_args: X

numeric_outlier

Select the outlier removal method to apply to the numeric column. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire numeric column. The outlier removal methodologies currently supported are as follows: numeric_outlier only works for x_columns, which is the column to be trained. Add the numeric column of the x_columns

normal: Removes outliers greater than 3 sigma from the current data distribution

numeric_outlier has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

Argument type: Custom
Input type
dictionary
Enterable values
default X
{Methodology: column list} -usage
numeric_outlier: {normal: [col1, col2]}
ui_args: X

numeric_scaler

numeric columns. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire numeric column. The scaling methodologies currently supported are as follows: numeric_scaler only works for x_columns, which is the column to be trained. Add the numeric column of the x_columns

Standard: Scaling using mean and standard deviation. z=(x-u)/s (u: mean, s: std)
minmax: Scaling with a maximum value of 1 and a minimum value of 0.
Robust: Scaling using medians and quartiles instead of means and variances.
maxabs: scaling the data so that the maximum absolute value is 1 and the value of 0 is 0.
Normalizer: Normalization is done per row, not on a column-based basis. Scaling so that the Euclidean distance between all features in a row is 1

numeric_scaler has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

Argument type: Custom
Input type
dictionary
Enterable values
default X
{Methodology: column list} -usage
numeric_outlier: {standard: [col1], minmax: [col2]}
ui_args: X

Sampling asset

data_split

Select the methodology to configure the train/validation set for HPO. Enter a dictionary of {method: methodology, options: value}. The 'Methodology - Value' combinations that can be entered are as follows.

cross validation
{method: cross_validation, options: 3}
It uses a cross-validation methodology, where options means kfold and above is a case set to kfold 3.
train/test split
{method: train_test, options: 0.3}
Divide the data by sampling it into train/validationsets. In Options, enter the percentage of the validation set. The example above is trained by sampling train:validation = 7:3.

You can see which cross_validation set the data was, or whether it was used as a train or validation set, in the output.csv's "data_split" column.

Argument type: Required
Input type
dictionary
Enterable values
{method: cross_validation, options: 3} (default)
{method: methodology, options: value} -usage
data_split: {method: cross_validation, options: 3}
ui_args: X

over_sampling

y_column applies the over sampling methodology. over_sampling arguments are divided into 2 types, depending on how you calculate the number of data to be sampled.

Ratio: Sampling so that the label of the y_column becomes the ratio

over_sampling: {
method: random,
label: B,
ratio: 2
                }
# Random over sampling so that label B is doubled

compare: Sampling so that the label of the y_column is multiply of the compare target label

over_sampling: {
method: random,
label: B,compare: {
target: A,
multiply: 10
                        }
                }
# Sampling so that label B is 10 times A

Dictionary and write each key and value value as follows.

key: method

Enter the over sampling methodology. The available methodologies are shown below.
Random: Applies random over sampling.
smote: Apply the smote methodology to over-sampling.

key: label

Enter the label of the y_column to apply the sampling methodology.
1 label value. ex) A
If there are multiple label values to apply sampling to, write it as a list. ex) [A, B]

key: ratio(type1)

Samples each 'label' so that it is the input rate.
Enter a float value. ex) 2.5
If you write down a number of 1 or less, sampling is not applied.

key: compare(type2)

Sampling to be n times the target label. Write a subdictionary. ex) compare: {target: C, multiply: 10}
sub_key: target
Enter a label that is the basis for determining the number of data to be sampled.
1 label value. ex) compare: {target: C ...}
sub_key: multiply
Apply sampling to the label by multiply times the target value. - Enter a float value. ex) label: [A,B], compare: {target: C, multiply: 10} - Oversample so that A,B is 10 times C.
If you write 'label' as a list, and multiply is also written as a list, the multiply value will be applied to each label. ex) label: [A, B], compare: {target: C, multiply: [2, 3]}: A oversamples 2 times C, B 3 times over C.
If the label contains more data than the number of data entered, do not oversample. This means that if you want to create 100 data by over sampling, you don't apply over sampling if you already have 200 data.

over_sampling has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

Argument type: Custom
Input type
dictionary
Enterable values
Write the dictionary format above.
ratio type
{method: methodology, label: label name, ratio: float}
ex) {method: smote, label: A, ratio: 10} - smote over sampling label A of y_column by 10 times.
ex) {method: random, label: [A,B], ratio: 10} - Random oversampling of label A and B of y_column 10 times.
ex) {method: smote, label: [A,B], ratio: [10,12]} - smote over sampling label A by 10 times and B by 12 times in y_column.
compare type
{method: methodology, label: label name, compare: {target: label name, multiply: float}}
ex) {method: random, label: A, compare: {target: C, multiply: 5}} - Random oversampling of label A in the y_column to be 5 times the number of label C.
ex) {method: random, label: [A,B], compare: {target: C, multiply: 5}} - Random oversampling of label A and B in the y_column so that they are 5 times the number of label C.
ex) {method: random, label: [A,B], compare: {target: C, multiply: [5,10]}} - Random oversampling so that label A in y_column is 5 times as much as label C and B is 10 times as large as label C. -usage
over_sampling: {method: smote, label: A, ratio: 10}
ui_args: X

under_sampling

y_column applies the under sampling methodology. under_sampling arguments are divided into 2 types, depending on how you calculate the number of data to be sampled.

Ratio: Sampling so that the label of the y_column becomes the ratio

under_sampling: {
method: random,
label: B,ratio: 0.5
                }
# Random under sampling so that label B is 0.5 times

compare: Sampling so that the label of the y_column is multiply of the compare target label

over_sampling: {
method: random,
label: B,
compare: {
target: A,
multiply: 2
                        }
                }
# Sampling so that label B is twice as large as A
# In this case, 2 times A must be less than the number of B data to apply under sampling

Dictionary and write each key and value value as follows.

key: method

Under Sampling Methodology. The available methodologies are shown below.
Random: Applies random under sampling.
nearmiss: Sample data that is difficult to distinguish between minority and majority classes (data that is close to each other).

key: label

Enter the label of the y_column to apply the sampling methodology.
1 label value. ex) A
If there are multiple label values to apply sampling to, write it as a list. ex) [A, B]

key: ratio(type1)

Samples each 'label' so that it is the input rate.
Enter a float value. ex) 0.7
If you write more than 1 number, do not apply sampling.

key: compare(type2)

Sampling to be n times the target label. Write a subdictionary. ex) compare: {target: C, multiply: 0.5}
sub_key: target
Enter a label that is the basis for determining the number of data to be sampled.
1 label value. ex) compare: {target: C ...}
sub_key: multiply
Sampling is applied to the label by multiply times the target value.
Enter a float value. ex) label: [A,B], compare: {target: C, multiply: 0.5} - Under sampling so that A,B is 0.5 times C.
If you write 'label' as a list, and multiply is also written as a list, the multiply value will be applied to each label. ex) label: [A, B], compare: {target: C, multiply: [0.2, 0.3]}: A is 0.2 times C and B is 0.3 times C under sampling.
If the label contains less data than the number of data entered, do not under sampling. This means that if you want to create 100 pieces of data by under sampling, you don't apply under sampling if you already have 90 pieces of data.

under_sampling has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

Argument type: Custom
Input type
dictionary
Enterable values
Write the dictionary format above.
ratiotype
{method: methodology, label: label name, ratio: float less than 1}
ex) {method: nearmiss, label: A, ratio: 0.5} - Samples label A of y_column by 0.5 times.
ex) {method: random, label: [A,B], ratio: 0.5} - Random under sampling of label A and B by 0.5 times in y_column.
ex) {method: random, label: [A,B], ratio: [0.5,0.3]} - Random under sampling of label A by 0.5 times and B by 0.3 times in y_column.
compare type
{method: methodology, label: label name, compare: {target: label name, multiply: float}}
ex) {method: random, label: A, compare: {target: C, multiply: 0.5}} - Random under sampling of label A in the y_column so that it is 0.5 times the number of label C.
ex) {method: random, label: [A,B], compare: {target: C, multiply: 0.5}} - Random under sampling of label A and B of the y_column so that they are 0.5 times the number of label C.
ex) {method: random, label: [A,B], compare: {target: C, multiply: [0.5,0.2]}} - Random under sampling so that label A in y_column is 0.5 times that of label C and B is 0.2 times that of label C. -usage
under_sampling: {method: nearmiss, label: A, ratio: 0.5}
ui_args: X

random_state

If you specify a random_state value, you will get the same result value for each sampling run.

Argument type: Custom
Input type
int
Enterable values
Positive Essence -usage
random_state: 123
ui_args: X

Train asset

evaluation_metric

Select an evaluation metric to select the best model for HPO. If you use the default value of 'auto', the task_type will select accuracy if it is classification and mse if it is regression. If several models with the same evaluation_metric value appear in the HPO process, the model is prioritized according to the following priority.

When the evaluation_metric values are the same:
For Classificaiton, compare the remaining metrics by model, except for evaluation_metric, in the order of accuracy, f1, recall, and precision. (If you select accuracy, compare the values in the order of f1, recall, precision)
For Regression, compare the remaining metrics by model except for evaluation_metric in the order of R2, MSE, MAE, and RMSE.
When all metrics are equal:
The smaller the model size, the more similar the model size, the RF, LGBM, GBM, XGB, and CB models are selected.

However, when all evaluation metrics are the same, the model added by the user has the highest priority if the user added the model themselves. - Argument type: Required

Input type
string
Enterable values
auto (default)
When task_type is classification: accuracy
When task_type is regression: MSE
task_type: When it comes to classification
accuracy
f1
recall
precision
task_type: When regression
mse
r2
mae
rmse -usage
evaluation_metric: auto
ui_args: O

shapley_value

Calculate the Shapley Value and decide if you want to output it to the output.csv together. When the shapley_value is calculated (shapley_value: True), the output is stored in the path {Output folder}/extra_output/train/summary_plot.png. You can use the summary plot to see which classes are affected by each feature.

Argument type: Required
Input type
boolean
Enterable values
False (default)
Doesn't calculate shapley value.
True
Calculate the Shapley Value. -usage
shapley_value: False
ui_args: O

output_type

The modeling results determine whether only the column is left in the output.csv or the entire column in the output.csv. The modeling result column is shown below.

prob_{y class people},...
The probability value that the model will classify that data into a specific class. There are as many columns as there are classes.
pred_{y column name}
The y-value column predicted by the model.
shap_{Learning column name}
When shapley_value is True, the shapley value column is printed. It is created with the number of training columns (x_columns).

If you set the output_type to 'all', the entire data from the train/inference asset and the modeling result columns will be stored in the output.csv. If you write the output_type as 'simple', only the columns of the modeling results will be stored in the output.csv. If the data you use for analysis is large, you can reduce the output.csv file size by setting the output_type to "simple".

Argument type: Required
Input type
string
Enterable values
all (default)
Save both the data from the asset and the modeling result columns in the output.csv.
simple
Save only the modeled result columns to the output.csv. -usage
output_type: all
ui_args: O

model_list

Enter the model to compare with HPO in the form of a list. Currently, the TCR is equipped with 5 Tree-series models, and HPO will be performed for all 5 models unless the user adds an model_list argument. The list of default models of TCR currently available is as follows.

rf: random forest
gbm: gradient boosting machine
lgbm: light gradient boosting machine- cb: catoost
xgb: Extreme Gradient Boosting

If you enter an empty list([]) in the model_list, it will be reflected as default([rf, gbm, lgbm, cb, xgb]). hpo_settings, but if the model name is not in the model_list, it will not be added to the HPO. In order to create a model template during solution development and add the newly added model to the HPO list, you need to add the model's summary to the model_list.

Argument type: Custom
Input type
list
Enterable values
[rf, gbm, lgbm, cb, xgb] (default. same behavior if typed as []) -usage
model_list: [rf, gbm, lgbm, cb, xgb]
ui_args: X

hpo_settings

[model_list]Change the parameter for the model in (#model_list). {Modelname: {parameter1: search list, tcr_param_mix: 'one_to_one'}}.

{rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}}

In the example above, the test conditions are 100,300,500 for max_depth and 300,400,500 for n_estimators. mean_sample_leaf has 3 input, but if the parameter contains a numeric value that is not a list, it freezes the parameter with that value. The values and functions that can be entered in 'tcr_param_mix' are as follows.

one_to_one
Each element corresponds 1:1 to proceed with HPO. If the parameter value is list, the number of elements must be the same.
In one_to_one, the example above would be {max_depth: 100, n_estimators: 300, min_sample_leaf: 3}, {max_depth: 300, n_estimators: 400, min_sample_leaf: 3}, {max_depth: 500, n_estimators: 500, min_sample_leaf: 3}.
all
Proceed with HPO with any combination of the parameter list you entered.
For all, the example above calculates for any combination of max_depth: \max_depth[100,300,500], n_estimators: [100,300,500], : {100,300,500], min_sample_leaf: 3}, n_estimators: 300, : 3}, {max_depth: 100, 100: 1,...,min_sample_leaf n_estimators 00 max_depth 00, n_estimators: 500, min_sample_leaf: 3}.

For models that are in the model_list but not in the hpo_settings, use the default parameter listed in the model file. In other words, if model_list is default (5 models) and {rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}}, the other 4 models except RF will use the default parameter in the model file.

Argument type: Custom
Input type
dictionary
Enterable values
Using the default parameter set in the model file (default)
{Modelname: {parameter1: search list, tcr_param_mix: one_to_one or all}} -usage
hpo_settings: {rf: {max_depth: [100, 300], n_estimators: 300, tcr_param_mix: one_to_one}}
ui_args: X

shapley_sampling

When shapley_value value is True, you can output a shapley value by sampilng only some of the data without sampling all the data. Because it takes a long time to learn when you get a shapley value for all values when you have a lot of data, you can reduce the training time by sampling some data to calculate the shapley value.

Argument type: Custom
Input type
float
int
Enterable values
10000 (default)
Float between 0-1
Sampling by the corresponding percentage.
- 1
Sampling all values
int greater than 1
Sampling as many inputs. -usage
shapley_sampling: 10000
ui_args: X

multiprocessing

Enter whether you want to enable multiprocessing. The default value is False, which means multiprocessing is not used, and Melleriakt does not currently recommend using multiproessing.

Argument type: Custom
Input type
Boolean
Enterable values
False (default)
True -usage
multiprocessing: False
ui_args: O

num_cpu_core

Enter the number of CPU cores to use in multiprocessing.

Argument type: Custom
Input type
int
Enterable values
3 (default)
ints greater than 0 -usage
num_cpu_core: 3
ui_args: O

TCR Version: 3.0.0

experimental_plan.yaml Explanation​

experimental_plan.yaml structure​

Enter data path ('dataset_uri')​

User Parameter ('function')​

User arguments explained​

What are User arguments?​

Required arguments​

Custom arguments​

Summary of User arguments​

Default​

ui_args​

User settings required?​

User arguments in detail​

Input asset​

file_type​

encoding​

Readiness asset​

x_columns​

y_column​

task_type​

target_label​

column_types​

report​

drop_x_columns​

groupkey_columns​

min_rows​

cardinality​

num_cat_split​

ignore_new_category​

Preprocess asset​

save_original_columns​

categorical_encoding​

handle_missing​

numeric_outlier​

numeric_scaler​

Sampling asset​

data_split​

over_sampling​

under_sampling​

random_state​

Train asset​

evaluation_metric​

shapley_value​

output_type​

model_list​

hpo_settings​

shapley_sampling​

multiprocessing​

num_cpu_core​