Skip to main content
Version: Next

TCR Parameter

Updated 2025.04.29

experimental_plan.yaml Explanation

To apply AI Contents to your data, you need to write information about the data and the Contents features you want to use in the experimental_plan.yaml file. If you install AI Contents in the solution folder, you can see the experimental_plan.yaml file created by default for each content under the solution folder. By entering the 'data information' in this YAML file and modifying/adding the 'user arugments' provided for each function, you can run ALO to create a data analysis model with the desired settings.

experimental_plan.yaml structure

experimental_plan.yaml contains various settings required to run ALO. If you modify the 'dataset_uri' and 'function' parts of the train/inference section of this setting value, you can start using AI Contents right away.

Enter data path ('dataset_uri')

  • The parameter of 'train' or 'inference' is used to specify the path to the file to load or the file to save.
train:
dataset_uri: [train_dataset/] # Data folder, folder list (no file types)
# dataset_uri: s3://mellerikat-test/tmp/alo/ # Example 1) S3 key(prefix) or lower path All folders, files
artifact_uri: train_artifact/
pipeline: [input, readiness, preprocess, sampling, train] # Execution target: List of functions

inference:
dataset_uri: inference_dataset/
# model_uri: model_artifacts/n100_depth5.pkl # load the already trained model
artifact_uri: Define compression and upload paths for files stored below the path inference_artifact/ # Optional) pipeline['artifact']['workspace']. Uploaded as inferenece.tar.gz below the default path
pipeline: [input, readiness, preprocess, sampling, inference]
Parameter NameDEFAULTDescription & Options
dataset_uri (train)[train_dataset/]Enter the folder path where the training data is located. (Enter the csv file name X) and concat all the csv under the path you entered.
dataset_uri (inference)inference_datasetEnter the folder path where the inference data is located. (Enter the csv file name X) and concat all the csv under the path you entered.

*Imports all files in the entered path subfolder and merges them. *All columns in the merged file must have the same name.

User Parameter ('function')

  • The fields below 'function' refer to the function name. The 'function: input' below means input step.
  • 'argument' means the user arguments of the input function('function: input'). User arguments are settings for data analysis provided for each function. For an explanation of this, please see the User arguments explanation below.

function: # Required) user-defined function
input:
def: pipeline.input # {python filename}. {function name}
argument:
file_type: csv
encoding: utf-8
...

User arguments explained

What are User arguments?

User arguments are used as parameters for setting the behavior of each asset, and they are written under the 'args' of each step in the experimental_plan.yaml. Each function that makes up the pipeline of AI Contents provides user arguments so that users can apply various functions to their data. Users can change and add user arguments by referring to the guide below to model according to their data. User arguments are divided into 'required arguments', which are pre-written in the experimental_plan.yaml, and 'Custom arguments', which are added by the user after viewing the guide.

Required arguments

  • Required arguments are the default arguments that are shown directly in the experimental_plan.yaml. Most required arguments have a default value set in the YAML file.
  • Data-related arguments in the experimental_plan.yaml must be set by the user. (ex. x_columns, y_column)

Custom arguments

  • Custom arguments are not written in the experimental_plan.yaml, but they are provided by each function and can be used by the user in addition to the experimental_plan.yaml. I use it by adding it to the 'argument' for each function in the YAML file.

TCR's pipeline is composed of Input - Readiness - Preprocess - Modeling(train/inference) functions, and user arguments are configured differently according to the function of each function. Start by modeling with the required arguments settings in your experimental_plan.yaml, then add user arguments to create a TCR model that fits your data!

train:
...
pipeline: [input, readiness, preprocess, sampling, train] # list of functions to be executed

inference:
...
pipeline: [input, readiness, preprocess, sampling, inference] # list of functions to be executed

Summary of User arguments

Below is a summary of user arguments in TCR. You can click on 'Argument name' to go to the detailed description of the argument. Currently, TCR only provides user arguments for train pipelines. The Inference Pipeline automatically fetches and uses the arguments settings used by the train. Therefore, you only need to write user arguments for train pipelines.

Default

  • The 'Default' field is the default value for the user argument.
  • If there is no default value, it is written as '-'.
  • If there is logic in default, it is marked as 'Note explanation'. Click 'Argument Name' to see the detailed description

ui_args

  • The 'ui_args' in the table below indicates whether the 'ui_args' function is supported, which allows you to change the argument value in the UI of AI Conductor.
  • O: If you enter the argument name under 'ui_args' in the experimental_plan.yaml, you can change the arguments value in the AI Conductor UI.
  • X: Doesn't support the 'ui_args' feature.
  • For a more detailed explanation of "ui_args", please check out the following guide. [Write UI Parameter](.. /.. /alo/alo-v3/register_ai_solution/write_ui_parameter)

User settings required?

  • In the table below, 'User Settings Required' is the user arguments that the user must check and change in order to make AI Contents work.
  • O: Arguments, usually assignments, where you enter data-related information, which you need to check before modeling.
  • X: If the user does not change the value, the modeling proceeds to the default value.
Step NameArgument typeArgumentDefaultDescriptionUser setup required?ui_args
InputRequiredfile_typecsvEnter the file extension for input data.XO
InputRequiredencodingutf-8Enter the encoding type of input data.XO
ReadinessRequiredx_columns-Enter the name of the x-column to learn.OO
ReadinessRequiredy_column-y Enter a column name.OO
ReadinessRequiredtask_typeclassificationClassification/Regression.OO
ReadinessRequiredtarget_label_majorEnter the class name that is the basis for calculating the metric in HPO.XX
ReadinessRequiredcolumn_typesautoEnter column type (categorical/numeric) information (automatic column type classification function 'auto' provided)XX
ReadinessRequiredreportTrueOutput a summary csv for train/inference data.XO
ReadinessCustomdrop_x_columns-If you have a large number of column names to enter, use x_columns instead.XO
ReadinessCustomgroupkey_columns-Group the dataframe based on the value value of the column you enteredXO
ReadinessCustommin_rowsNote the explanationSpecifies the minimum number of rows required at training time.XX
ReadinessCustomcardinality50Categorical columns are categorized based on the value entered.XX
ReadinessCustomnum_cat_split10This is the value used when automatically classifying column types, and adjusts the classification criteria.XX
ReadinessCustomignore_new_categoryFalseHandles behavior when new category values come in during inference.XX
PreprocessCustomsave_original_columnsTrueDecide whether to leave the source learning column (x_columns) in the Preprocess Asset resulting dataframe.XO
PreprocessCustomcategorical_encoding{binary: all}Specifies the encoding methodology to apply to the categorical column.XX
PreprocessCustomhandle_missingNote the explanationSpecifies how to handle missing values to apply to the column.XX
PreprocessCustomnumeric_outlier-Select the outlier removal method to apply to the numeric column.XX
PreprocessCustomnumeric_scaler-numeric columns.XX
SamplingRequireddata_split{method: cross_validation, options: 3}For HPO, select the train/validation set segmentation methodology.XX
SamplingCustomover_sampling-Y Column's Over Sampling methodology by label.XX
SamplingCustomunder_sampling-Y column's Under Sampling methodology by label.XX
SamplingCustomrandom_state-When you perform sampling, specify a random seed to get the same result.XX
TrainRequiredevaluation_metricautoDetermine the evaluation metric for selecting the best model for HPO.XO
TrainRequiredshapley_valueFalseDetermines whether to compute a shapley value and output it to the output.csv.XO
TrainRequiredoutput_typeallChoose whether to output only the minimum number of columns (modeling results) to the output.csv.XO
TrainCustommodel_list[rf, gbm, lgbm, cb, xgb]Select the model you want to compare with HPO.XX
TrainCustomhpo_settingsNote the explanationChange the parameter for the model in the model_list.XX
TrainCustomshapley_sampling10000shapley_value Enter the degree of data sampling when extracting the value.XX
TrainCustommultiprocessingFalseEnter whether or not multiprocessing is enabled.XO
TrainCustomnum_cpu_core3Enter the number of CPU cores to use in multiprocessing.XO

User arguments in detail

Input asset

file_type

Enter the file extension for Input data. Currently, AI Solution development is only available in csv files.

  • Argument type: Required
  • Input type
  • string
  • Enterable values
  • csv (default) -usage
  • file_type: csv  
  • ui_args: O

encoding

Enter the encoding type of input data. Currently, AI Solution development is only available for UTF-8 encoding.

  • Argument type: Required
  • Input type
  • string
  • Enterable values
  • utf-8 (default) -usage
  • encoding: utf-8
  • ui_args: O

Readiness asset

x_columns

Enter the name of the training target x column in the Dataframe in the form of a list. The user must enter the input data correctly. If you have a large number of columns to enter, you can use the custom argument drop_x_columns to target the entire DataFrame for training. However, you must use either x_columns or drop_x_columns (arguments that do not use either should be removed or commented out in the YAML file).  

  • Argument type: Required
  • Input type
  • list
  • Enterable values
  • Column name list -usage
  • x_columns: [col1, col2]
  • ui_args: O

y_column

Enter one of the names of the y column in the Dataframe. The user must enter the input data correctly.

  • Argument type: Required
  • Input type
  • string
  • Enterable values
  • Column Name -usage
  • y_column: target
  • ui_args: O

task_type

Enter the type of Solution assignment (classification/regression). You must ensure that the value is set to the task_type that the user is using for the purpose of the assignment.

  • Argument type: Required
  • Input type
  • string
  • Enterable values
  • classification (default)
  • regression -usage
  • task_type: classification
  • ui_args: O

target_label

When training the classification model, determine the class of the y_column that is the basis for calculating the model evaluation metric at HPO. For example, if evaluation_metric is precision and target_label is 1, the model with the highest precision value of label 1 is selected as the best model. task_type does not work when regression.    

  • Argument type: Required
  • Input type
  • string
  • list
  • Enterable values
  • _major (default)
  • Select the class with the highest number in the y_column. (binary, multiclass, all possible)
  • _minor
  • Select the class with the fewest number in the y_column. (binary, multiclass, all possible)
  • _all
  • Based on all class names in the y_column. (multiclass only)
  • Label values
  • Enter one class name for the y_column. (binary, multiclass, all possible)
  • ex) target_label: setosa
  • Label value list
  • list: Enter multiple y_column classes. (multiclass only)
  • ex) target_label: [setosa, versicolor] -usage
  • target_label: _major
  • ui_args: X

column_types

Enter whether the training column (x_columns) type is numeric or categorical. If you use the default value of 'auto', the readiness asset will automatically classify whether the x_columns is numeric/categorical. If you need to always specify a particular column as numeric or categorical, use column_type arguments as shown below.

  • ex) column_types: {categorical_columns: [col1, col2]}
  • ex) column_types: {numeric_columns: [col1, col2]}
  • ex) column_types: {categorical_columns: [col1], numeric_columns: [col2]}

Columns entered into the column_types are categorical or numeric columns as specified by the user, and columns that are not entered are automatically classified as numeric/categorical columns by auto logic.

  • Argument type: Required
  • Input type
  • string
  • dictionary
  • Enterable values
  • auto (default)
  • {categorical_columns: column names list, numeric_columns: column names list} -usage
  • column_types: auto
  • ui_args: X

report

Decide whether to create a summary csv file for input data(train/infernece). The type of data (categorical/numeric), category information, cardinality, statistics, missing value count, and missing percentage are recorded. ignore_new_category: If true inference contains category data that was not used for training, a 'new-categories' column will be created.

  • Argument type: Required
  • Input type
  • boolean
  • Enterable values
  • True (default)
  • Create {train/inference}_artifacts/extra_output/readiness/report.csv.
  • False
  • It does not create report.csv. -usage
  • report: True
  • ui_args: O

drop_x_columns

If you have a large number of column names to enter, you can use drop_x_columns instead of x_columns to specify the training columns by loading the entire column names in the dataframe and then deleting only some of the columns. You must use either x_columns or drop_x_columns (arguments that are not used should be removed or commented out in the YAML file). When drop_x_columns is [], use the entire DataFrame column as the training column (training columns: groupkey_columns, all but y_column columns). If you enter a list of columns to exclude from training in drop_x_columns, the remaining columns except the columns entered in drop_x_columns are used as training columns (training columns: groupkey_columns, y_column, and drop_x_columns columns). ex) Whole column: x0,x1,x2,x3,x4,y, drop_x_columns=[x0], groupkey_columns=[x1], y_column=y, the column to be trained will be x2,x3,x4.

  • Argument type: Custom
  • Input type
  • list
  • Enterable values
  • []
  • If you enter an empty list, the entire DataFrame column is used as the training column except for groupkey_columns and y_column.
  • Column name list
  • Use the rest of the columns in the DataFrame as training columns, except for the groupkey_columns, y_column, and column namelist. -usage
  • drop_x_columns: []
  • ui_args: O

groupkey_columns

The groupkey function is a way to analyze data by grouping it based on the value of a specific column. In the table below, if you specify 'groupkey col' as the groupkey column, the data with a value of A and the data with a value of B in the 'groupkey col' column will be modeled. In this case, 'groupkey col' becomes groupkey column, and A and B become groupkey. groupkey_columns argument allows you to use the groupkey function.

x0...x10groupkey col
......A
......A
......B
......B
......A

If you enter multiple column names as lists in groupkey_columns, the readiness asset creates a single unified groupkey column that concates each value in the groupkey_columns. For example, if you enter 'groupkey_columns: [Gender, Pclass]', a new groupkey column named 'Gender_Pclass' will be added to the input dataframe. However, in the case of classification, the class type of the y_column must be the same for each groupkey. Groups with different class types in the y_column (groupkey) are excluded from training. This means that when the y_column value consists of A,B,C, if the y_column of a particular group consists of only A,B, that group is excluded from learning. Other groups that do not meet the learning criteria in the readiness asset (groupkey) are excluded from the training.

  • Argument type: Custom
  • Input type
  • list
  • Enterable values
  • Column name list -usage
  • groupkey_columns: [col1, col2]
  • ui_args: O

min_rows

Specifies the minimum number of rows required at training time. If the training data doesn't meet the minimum number of rows, it raises an error. The default value is task_type: 30 for classification and 100 for regression. If you use the default value, there must be more than 30 data per y label at the time of classification, and the total number of data must be more than 100 at the time of regression to be able to learn. If the user enters min_rows value, the number of data must exist as much as the min_rows entered for each y label in the case of classification, and the total number of data must exist at least min_rows entered in the case of regression to be able to learn. For example, if you don't use the default value and add an argument to the experimental_plan.yaml with "min_rows: 50", then training will only occur if there is at least 50 data for each y label for classification, and more than 50 for the entire data for regression.

If you use the groupkey function ([groupkey_columns](#groupkey_columns>, if a specific groupkey does not meet the min_rows conditions, the groupkey will be excluded from the groupkey to be trained. If you use the groupkey function, all groupkeys will cause an error if they do not meet the min_row conditions.

  • Argument type: Custom
  • Input type
  • int
  • Enterable values
  • default
  • 30 (classification, the number of data per class in the y_column must be at least 30)
  • 100 (regression, total data count must be at least 100)
  • Numeric values -usage
  • min_rows: 50
  • ui_args: X

cardinality

Cardinality conditions that categorical_columns must meet when using the categorical/numeric column auto-classification feature. The unique value of the categorical column must be less than or equal to the specified cardinality argument value for the final classification as a categorical column. If the number of unique values in a particular column is greater than the cardinality argument, the column is not classified as a categorical column and is excluded from the training column.  

  • Argument type: Custom
  • Input type
  • int
  • Enterable values
  • 50 (default)
  • Numeric values -usage
  • cardinality: 50
  • ui_args: X

num_cat_split

When using the categorical/numeric column auto-classification function (column_types: auto), numeric/categorical columns are classified by checking whether the frequently appearing top N data is numeric/object. num_cat_split specifies an N-value.

  • Argument type: Custom
  • Input type- int
  • Enterable values
  • 10 (default)
  • Numeric values -usage
  • num_cat_split: 10
  • ui_args: X

ignore_new_category

Controls the behavior when a value that has not been used for training comes into the categorical column during inference. For example, if you apply onehot encoding to a categorical column in train, the model will learn the onehot encoded column based on the train data. If a new category value is introduced to the categorical column during inference that has not been used for training, the category value cannot be processed by the previously trained 1Hot Encoding column. Therefore, if there is a high probability that category columns that were not used for training will come in during inference, it is better to use ignore_new_category arguments to control the behavior.

  • Argument type: Custom
  • Input type
  • boolean
  • float
  • Enterable values
  • False (default)
  • If a category value that is not used for training comes in during inference, an error will be raised.
  • True
  • If a category value that is not used for training comes in during inference, fill in the missing value to proceed with inference. (Missing values are processed according to the processing logic of the preprocess asset.)
  • Catboost encoding allows encoding without missing value processing for category data that is not used for training. categorical_encoding
  • Float values between 0 and 1
  • ex) 0.3
  • If the proportion of rows with category values that are not used for training in the total data is less than 0.3, fill in the missing values to perform inference.
  • If the percentage of rows with category values that are not used for training in the total data is greater than or equal to 0.3, an error will be raised. (groupkeycolumn, if any, the group is excluded from training.) -usage
  • ignore_new_category: False
  • ui_args: X

Preprocess asset

save_original_columns

Preprocess assets apply various preprocessing methodologies to the columns (x_columns) to be trained. save_original_columns chooses whether to paste the source of the x_columns, the column to be trained, to the resulting dataframe of the preprocess asset and pass it to the next asset. save_original_columns or not, the training target columns used by the following assets are replaced by the preprocessed learning columns in the preprocess asset.

  • Argument type: Custom
  • Input type
  • boolean
  • Enterable values
  • True (default)
  • Pass the unused column for training + the column to be trained (x_columns) + the preprocessed training column to the next asset.
  • The y-column and the preprocessed y-column are included in the dataframe.
  • False
  • Pass the unused columns for training + the preprocessed training columns to the next asset. (x_columns original deleted)
  • The y-column and the preprocessed y-column are included in the dataframe. -usage
  • save_original_columns: True
  • ui_args: O

categorical_encoding

categorical_encoding specifies the encoding methodology to apply to the category column. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire categorical column. The categorical encodings currently supported are as follows: categorical_eocnding only works for x_columns, which is the column to be trained. Add the categorical column of the x_columns. The Y column is task_type: For classification, the label encoding is applied and cannot be changed.

  • binary: binary encoding
  • catboost: catboost encoding
  • onehot: onehot encoding
  • label: label encoding

Currently, by default, binary encoding is applied to all categorical columns for training. When using categorical_encoding, if you specify some columns, the rest of the columns will automatically be subject to the default rule (binary). For example, if categorical_encoding: {label: [col1]}, then binary encoding is applied to all categorical columns except col1.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • default
  • x_columns
  • {binary: all}
  • y_column
  • Label encoding
  • {Methodology 1: Column list, Methodology 2: Column list} -usage
  • categorical_encoding: {binary: [col1], catboost: [col2]}
  • ui_args: X

handle_missing

handle_missing specifies how missing values should be handled for categorical and numeric columns. Enter in the form of a dictionary of {Methodology: Value}. Different categorical/numeric column types have different methodologies that can be applied. For 'Value', you can use 'column list' and 'categorical_all', 'numeric_all', and 'all'. handle_missing applies default logic to the missing values of the columns to be trained unless the user specifies a methodology. When you enter some columns in the handle_missing, the remaining columns are automatically subject to the default rule. **handle_missing only works for x_columns, which is the column to be trained. x_columns columns. ** In train pipelines, the missing value rows of the y column are automatically deleted.

  • Categorical columns only
  • In the value {Methodology: Values}, you can enter "categorical column list" or "categorical_all". (categorical_all entire categorical column at input)
  • frequent: Fills in missing values with the most common occurrences in the column.
  • Numeric columns only
  • {Methodology: Values} You can enter 'numeric column list' and 'numeric_all' in the value. (numeric_all Numeric column as input)
  • mean: Populates the missing values with the average value of the column.
  • median: Fills in missing values based on the median of the column.
  • Interpolation: Populates the value by the average of the values around the missing row in the column.
  • Methodology applicable to all column types
  • In the value {Methodology: Values}, you can enter 'column list' and 'all', 'categorical_all', and 'numeric_all'. (All columns when all is entered)
  • drop: Removes rows with missing values in the column.
  • fill_ value: Populates the missing value of the column with the value entered in 'Value'.

categorical_all, numeric_all, and all types are used as follows.

  • handle_missing: {frequent: categorical_all, fill_0: numeric_all}
  • Fill categorical columns with frequent (available only for categorical methodology) and numeric columns with missing values with zeros.
  • handle_missing: {fill_0: categorical_all, fill_1: numeric_all}
  • Fill the missing values with 0 for categorical columns and 1 for numeric columns.
  • handle_missing: {fill_0: all}
  • Fill all columns with missing zeros.
  • handle_missing: {fill_0: numeric_all}
  • Numeric columns are filled with zeros, and categorical columns are given default logic.

categorical_all and numeric_all can be used together, but categorical_all and all, or numeric_all and all, cannot be used together.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • default
  • x_columns
  • {frequent: categorical_all, median: numeric_all}
  • y_column
  • Drop application - {Methodology 1: Column list, Methodology 2: Column list} -usage
  • handle_missing: {fill_1: [col1], fill_2: [col2]}
  • ui_args: X

numeric_outlier

Select the outlier removal method to apply to the numeric column. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire numeric column. The outlier removal methodologies currently supported are as follows: numeric_outlier only works for x_columns, which is the column to be trained. Add the numeric column of the x_columns

  • normal: Removes outliers greater than 3 sigma from the current data distribution

numeric_outlier has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • default X
  • {Methodology: column list} -usage
  • numeric_outlier: {normal: [col1, col2]}
  • ui_args: X

numeric_scaler

numeric columns. Enter in the form of a dictionary of {Methodology: Value}. For "Value" in {Methodology: Value}, you can enter "Column List" or "All" to apply the methodology. In this case, all is the entire numeric column. The scaling methodologies currently supported are as follows: numeric_scaler only works for x_columns, which is the column to be trained. Add the numeric column of the x_columns

  • Standard: Scaling using mean and standard deviation. z=(x-u)/s (u: mean, s: std)
  • minmax: Scaling with a maximum value of 1 and a minimum value of 0.
  • Robust: Scaling using medians and quartiles instead of means and variances.
  • maxabs: scaling the data so that the maximum absolute value is 1 and the value of 0 is 0.
  • Normalizer: Normalization is done per row, not on a column-based basis. Scaling so that the Euclidean distance between all features in a row is 1

numeric_scaler has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • default X
  • {Methodology: column list} -usage
  • numeric_outlier: {standard: [col1], minmax: [col2]}
  • ui_args: X  

Sampling asset

data_split

Select the methodology to configure the train/validation set for HPO. Enter a dictionary of {method: methodology, options: value}. The 'Methodology - Value' combinations that can be entered are as follows.

  • cross validation
  • {method: cross_validation, options: 3}
  • It uses a cross-validation methodology, where options means kfold and above is a case set to kfold 3.
  • train/test split
  • {method: train_test, options: 0.3}
  • Divide the data by sampling it into train/validationsets. In Options, enter the percentage of the validation set. The example above is trained by sampling train:validation = 7:3.

You can see which cross_validation set the data was, or whether it was used as a train or validation set, in the output.csv's "data_split" column.

  • Argument type: Required
  • Input type
  • dictionary
  • Enterable values
  • {method: cross_validation, options: 3} (default)
  • {method: methodology, options: value} -usage
  • data_split: {method: cross_validation, options: 3}
  • ui_args: X        

over_sampling

y_column applies the over sampling methodology. over_sampling arguments are divided into 2 types, depending on how you calculate the number of data to be sampled.

  1. Ratio: Sampling so that the label of the y_column becomes the ratio
over_sampling: {
method: random,
label: B,
ratio: 2
}
# Random over sampling so that label B is doubled
  1. compare: Sampling so that the label of the y_column is multiply of the compare target label
over_sampling: {
method: random,
label: B,compare: {
target: A,
multiply: 10
}
}
# Sampling so that label B is 10 times A

Dictionary and write each key and value value as follows.

key: method

  • Enter the over sampling methodology. The available methodologies are shown below.
  • Random: Applies random over sampling.
  • smote: Apply the smote methodology to over-sampling.

key: label

  • Enter the label of the y_column to apply the sampling methodology.
  • 1 label value. ex) A
  • If there are multiple label values to apply sampling to, write it as a list. ex) [A, B]

key: ratio(type1)

  • Samples each 'label' so that it is the input rate.
  • Enter a float value. ex) 2.5
  • If you write down a number of 1 or less, sampling is not applied.

key: compare(type2)

  • Sampling to be n times the target label. Write a subdictionary. ex) compare: {target: C, multiply: 10}
  • sub_key: target
  • Enter a label that is the basis for determining the number of data to be sampled.
  • 1 label value. ex) compare: {target: C ...}
  • sub_key: multiply
  • Apply sampling to the label by multiply times the target value. - Enter a float value. ex) label: [A,B], compare: {target: C, multiply: 10} - Oversample so that A,B is 10 times C.
  • If you write 'label' as a list, and multiply is also written as a list, the multiply value will be applied to each label. ex) label: [A, B], compare: {target: C, multiply: [2, 3]}: A oversamples 2 times C, B 3 times over C.
  • If the label contains more data than the number of data entered, do not oversample. This means that if you want to create 100 data by over sampling, you don't apply over sampling if you already have 200 data.    

over_sampling has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.  

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • Write the dictionary format above.
  • ratio type
  • {method: methodology, label: label name, ratio: float}
  • ex) {method: smote, label: A, ratio: 10} - smote over sampling label A of y_column by 10 times.
  • ex) {method: random, label: [A,B], ratio: 10} - Random oversampling of label A and B of y_column 10 times.
  • ex) {method: smote, label: [A,B], ratio: [10,12]} - smote over sampling label A by 10 times and B by 12 times in y_column.
  • compare type
  • {method: methodology, label: label name, compare: {target: label name, multiply: float}}
  • ex) {method: random, label: A, compare: {target: C, multiply: 5}} - Random oversampling of label A in the y_column to be 5 times the number of label C.
  • ex) {method: random, label: [A,B], compare: {target: C, multiply: 5}} - Random oversampling of label A and B in the y_column so that they are 5 times the number of label C.
  • ex) {method: random, label: [A,B], compare: {target: C, multiply: [5,10]}} - Random oversampling so that label A in y_column is 5 times as much as label C and B is 10 times as large as label C. -usage
  • over_sampling: {method: smote, label: A, ratio: 10}
  • ui_args: X

under_sampling

y_column applies the under sampling methodology. under_sampling arguments are divided into 2 types, depending on how you calculate the number of data to be sampled.

  1. Ratio: Sampling so that the label of the y_column becomes the ratio
under_sampling: {
method: random,
label: B,ratio: 0.5
}
# Random under sampling so that label B is 0.5 times
  1. compare: Sampling so that the label of the y_column is multiply of the compare target label
over_sampling: {
method: random,
label: B,
compare: {
target: A,
multiply: 2
}
}
# Sampling so that label B is twice as large as A
# In this case, 2 times A must be less than the number of B data to apply under sampling

Dictionary and write each key and value value as follows.

key: method

  • Under Sampling Methodology. The available methodologies are shown below.
  • Random: Applies random under sampling.
  • nearmiss: Sample data that is difficult to distinguish between minority and majority classes (data that is close to each other).

key: label

  • Enter the label of the y_column to apply the sampling methodology.
  • 1 label value. ex) A
  • If there are multiple label values to apply sampling to, write it as a list. ex) [A, B]

key: ratio(type1)

  • Samples each 'label' so that it is the input rate.
  • Enter a float value. ex) 0.7
  • If you write more than 1 number, do not apply sampling.

key: compare(type2)

  • Sampling to be n times the target label. Write a subdictionary. ex) compare: {target: C, multiply: 0.5}
  • sub_key: target
  • Enter a label that is the basis for determining the number of data to be sampled.
  • 1 label value. ex) compare: {target: C ...}
  • sub_key: multiply
  • Sampling is applied to the label by multiply times the target value.
  • Enter a float value. ex) label: [A,B], compare: {target: C, multiply: 0.5} - Under sampling so that A,B is 0.5 times C.
  • If you write 'label' as a list, and multiply is also written as a list, the multiply value will be applied to each label. ex) label: [A, B], compare: {target: C, multiply: [0.2, 0.3]}: A is 0.2 times C and B is 0.3 times C under sampling.
  • If the label contains less data than the number of data entered, do not under sampling. This means that if you want to create 100 pieces of data by under sampling, you don't apply under sampling if you already have 90 pieces of data.    

under_sampling has no default value. This means that if a user does not register for experimental_plan.yaml, no methodology will be applied.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • Write the dictionary format above.
  • ratiotype
  • {method: methodology, label: label name, ratio: float less than 1}
  • ex) {method: nearmiss, label: A, ratio: 0.5} - Samples label A of y_column by 0.5 times.
  • ex) {method: random, label: [A,B], ratio: 0.5} - Random under sampling of label A and B by 0.5 times in y_column.
  • ex) {method: random, label: [A,B], ratio: [0.5,0.3]} - Random under sampling of label A by 0.5 times and B by 0.3 times in y_column.
  • compare type
  • {method: methodology, label: label name, compare: {target: label name, multiply: float}}
  • ex) {method: random, label: A, compare: {target: C, multiply: 0.5}} - Random under sampling of label A in the y_column so that it is 0.5 times the number of label C.
  • ex) {method: random, label: [A,B], compare: {target: C, multiply: 0.5}} - Random under sampling of label A and B of the y_column so that they are 0.5 times the number of label C.
  • ex) {method: random, label: [A,B], compare: {target: C, multiply: [0.5,0.2]}} - Random under sampling so that label A in y_column is 0.5 times that of label C and B is 0.2 times that of label C. -usage
  • under_sampling: {method: nearmiss, label: A, ratio: 0.5}
  • ui_args: X

random_state

If you specify a random_state value, you will get the same result value for each sampling run.

  • Argument type: Custom
  • Input type
  • int
  • Enterable values
  • Positive Essence -usage
  • random_state: 123
  • ui_args: X

Train asset

evaluation_metric

Select an evaluation metric to select the best model for HPO. If you use the default value of 'auto', the task_type will select accuracy if it is classification and mse if it is regression. If several models with the same evaluation_metric value appear in the HPO process, the model is prioritized according to the following priority.

  • When the evaluation_metric values are the same:
  • For Classificaiton, compare the remaining metrics by model, except for evaluation_metric, in the order of accuracy, f1, recall, and precision. (If you select accuracy, compare the values in the order of f1, recall, precision)
  • For Regression, compare the remaining metrics by model except for evaluation_metric in the order of R2, MSE, MAE, and RMSE.
  • When all metrics are equal:
  • The smaller the model size, the more similar the model size, the RF, LGBM, GBM, XGB, and CB models are selected.

However, when all evaluation metrics are the same, the model added by the user has the highest priority if the user added the model themselves. - Argument type: Required

  • Input type
  • string
  • Enterable values
  • auto (default)
  • When task_type is classification: accuracy
  • When task_type is regression: MSE
  • task_type: When it comes to classification
  • accuracy
  • f1
  • recall
  • precision
  • task_type: When regression
  • mse
  • r2
  • mae
  • rmse -usage
  • evaluation_metric: auto
  • ui_args: O

shapley_value

Calculate the Shapley Value and decide if you want to output it to the output.csv together. When the shapley_value is calculated (shapley_value: True), the output is stored in the path {Output folder}/extra_output/train/summary_plot.png. You can use the summary plot to see which classes are affected by each feature.

  • Argument type: Required
  • Input type
  • boolean
  • Enterable values
  • False (default)
  • Doesn't calculate shapley value.
  • True
  • Calculate the Shapley Value. -usage
  • shapley_value: False
  • ui_args: O

output_type

The modeling results determine whether only the column is left in the output.csv or the entire column in the output.csv. The modeling result column is shown below.

  • prob_{y class people},...
  • The probability value that the model will classify that data into a specific class. There are as many columns as there are classes.
  • pred_{y column name}
  • The y-value column predicted by the model.
  • shap_{Learning column name}
  • When shapley_value is True, the shapley value column is printed. It is created with the number of training columns (x_columns).

If you set the output_type to 'all', the entire data from the train/inference asset and the modeling result columns will be stored in the output.csv. If you write the output_type as 'simple', only the columns of the modeling results will be stored in the output.csv. If the data you use for analysis is large, you can reduce the output.csv file size by setting the output_type to "simple".

  • Argument type: Required
  • Input type
  • string
  • Enterable values
  • all (default)
  • Save both the data from the asset and the modeling result columns in the output.csv.
  • simple
  • Save only the modeled result columns to the output.csv. -usage
  • output_type: all
  • ui_args: O

model_list

Enter the model to compare with HPO in the form of a list. Currently, the TCR is equipped with 5 Tree-series models, and HPO will be performed for all 5 models unless the user adds an model_list argument. The list of default models of TCR currently available is as follows.

  • rf: random forest
  • gbm: gradient boosting machine
  • lgbm: light gradient boosting machine- cb: catoost
  • xgb: Extreme Gradient Boosting

If you enter an empty list([]) in the model_list, it will be reflected as default([rf, gbm, lgbm, cb, xgb]). hpo_settings, but if the model name is not in the model_list, it will not be added to the HPO. In order to create a model template during solution development and add the newly added model to the HPO list, you need to add the model's summary to the model_list.

  • Argument type: Custom
  • Input type
  • list
  • Enterable values
  • [rf, gbm, lgbm, cb, xgb] (default. same behavior if typed as []) -usage
  • model_list: [rf, gbm, lgbm, cb, xgb]
  • ui_args: X

hpo_settings

[model_list]Change the parameter for the model in (#model_list). {Modelname: {parameter1: search list, tcr_param_mix: 'one_to_one'}}.

{rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}}

In the example above, the test conditions are 100,300,500 for max_depth and 300,400,500 for n_estimators. mean_sample_leaf has 3 input, but if the parameter contains a numeric value that is not a list, it freezes the parameter with that value. The values and functions that can be entered in 'tcr_param_mix' are as follows.

  • one_to_one
  • Each element corresponds 1:1 to proceed with HPO. If the parameter value is list, the number of elements must be the same.
  • In one_to_one, the example above would be {max_depth: 100, n_estimators: 300, min_sample_leaf: 3}, {max_depth: 300, n_estimators: 400, min_sample_leaf: 3}, {max_depth: 500, n_estimators: 500, min_sample_leaf: 3}.
  • all
  • Proceed with HPO with any combination of the parameter list you entered.
  • For all, the example above calculates for any combination of max_depth: \max_depth[100,300,500], n_estimators: [100,300,500], : {100,300,500], min_sample_leaf: 3}, n_estimators: 300, : 3}, {max_depth: 100, 100: 1,...,min_sample_leaf n_estimators 00 max_depth 00, n_estimators: 500, min_sample_leaf: 3}.

For models that are in the model_list but not in the hpo_settings, use the default parameter listed in the model file. In other words, if model_list is default (5 models) and {rf: {max_depth: [100, 300, 500], n_estimators: [300, 400, 500], min_sample_leaf: 3, tcr_param_mix: one_to_one}}, the other 4 models except RF will use the default parameter in the model file.

  • Argument type: Custom
  • Input type
  • dictionary
  • Enterable values
  • Using the default parameter set in the model file (default)
  • {Modelname: {parameter1: search list, tcr_param_mix: one_to_one or all}} -usage
  • hpo_settings: {rf: {max_depth: [100, 300], n_estimators: 300, tcr_param_mix: one_to_one}}
  • ui_args: X      

shapley_sampling

When shapley_value value is True, you can output a shapley value by sampilng only some of the data without sampling all the data. Because it takes a long time to learn when you get a shapley value for all values when you have a lot of data, you can reduce the training time by sampling some data to calculate the shapley value.

  • Argument type: Custom
  • Input type
  • float
  • int
  • Enterable values
  • 10000 (default)
  • Float between 0-1
  • Sampling by the corresponding percentage.
    • 1
  • Sampling all values
  • int greater than 1
  • Sampling as many inputs. -usage
  • shapley_sampling: 10000
  • ui_args: X

multiprocessing

Enter whether you want to enable multiprocessing. The default value is False, which means multiprocessing is not used, and Melleriakt does not currently recommend using multiproessing.

  • Argument type: Custom
  • Input type
  • Boolean
  • Enterable values
  • False (default)
  • True -usage
  • multiprocessing: False
  • ui_args: O

num_cpu_core

Enter the number of CPU cores to use in multiprocessing.

  • Argument type: Custom
  • Input type
  • int
  • Enterable values
  • 3 (default)
  • ints greater than 0 -usage
  • num_cpu_core: 3
  • ui_args: O

TCR Version: 3.0.0