TCR Parameter
Overview of experimental_plan.yaml
To apply AI Contents to your data, you need to provide information about the data and the functionalities you wish to use in the experimental_plan.yaml file. Once you install AI Contents in the solution folder, you can find the pre-written experimental_plan.yaml file for each content under the solution folder. By inputting 'data information' and modifying/adding 'user arguments' for each asset in this YAML file, you can execute ALO to create a data analysis model with the desired settings.
Structure of experimental_plan.yaml
The experimental_plan.yaml file contains various settings required to run ALO. By modifying the 'data path' and 'user arguments' parts, you can immediately use AI Contents.
Input Data Path (external_path
)
The external_path
parameter is used to specify the path of the file to be loaded or the path where the file will be saved. If save_train_artifacts_path
and save_inference_artifacts_path
are not specified, the modeling artifacts will be saved in the default folders train_artifacts
and inference_artifacts
, respectively.
external_path:
- load_train_data_path: ./solution/sample_data/train
- load_inference_data_path: ./solution/sample_data/test
- save_train_artifacts_path:
- save_inference_artifacts_path:
Parameter Name | DEFAULT | Description |
---|---|---|
load_train_data_path | ./sample_data/train/ | Specify the folder path where the training data is located. (Do not enter the CSV file name) All CSV files in the specified path will be concatenated. |
load_inference_data_path | ./sample_data/test/ | Specify the folder path where the inference data is located. (Do not enter the CSV file name) All CSV files in the specified path will be concatenated. |
* Files in subfolders under the specified path will also be included. | ||
* All columns in the files to be concatenated must be identical. |
User Parameters (user_parameters
)
- The
step
underuser_parameters
refers to the asset name. For example,step: input
indicates the input asset stage. args
refers to the user arguments of the input asset (step: input
). User arguments are data analysis-related setting parameters provided for each asset. See the User Arguments Description below for more details.
user_parameters:
- train_pipeline:
- step: input
args:
- file_type
...
ui_args:
...
User Arguments Description
What are User Arguments?
User arguments are parameters for configuring the operation of each asset, written under the args
of each asset step in the experimental_plan.yaml. Each asset constituting the AI Contents pipeline provides user arguments to apply various functionalities to the data. Users can refer to this guide to modify and add user arguments to create a model that fits their data.
User arguments are divided into 'required arguments' pre-written in the experimental_plan.yaml and 'custom arguments' added by the user referring to the guide.
Required Arguments
- Required arguments are the basic arguments immediately visible in the experimental_plan.yaml. Most required arguments have default values set in the YAML file.
- For some required arguments related to data, users must set the values. (e.g., x_columns, y_column)
Custom Arguments
- Custom arguments are functionalities provided by assets but not written in the experimental_plan.yaml. Users can add them under the
args
of each asset in the YAML file.
The TCR pipeline consists of Input - Readiness - Preprocess - Modeling (train/inference) - Output assets, and user arguments are configured differently for each asset's functionality. First, try modeling with the required arguments settings written in the experimental_plan.yaml and then add user arguments to create a TCR model that perfectly fits your data!
Summary of User Arguments
Below is a summary of TCR's user arguments. Click on the 'Argument Name' to move to the detailed explanation of the argument. Currently, TCR provides user arguments only for the train pipeline. The inference pipeline automatically adopts the arguments settings used in the train pipeline. Therefore, you only need to write user arguments in the train pipeline.
Default
- The 'Default' item indicates the default value of the user argument.
- If there is no default value, it is indicated by '-'.
- If there is a logic-based default, it is indicated by 'Refer to description'. Click on the 'Argument Name' to see the detailed explanation.
ui_args
- The 'ui_args' in the table below indicate whether the
ui_args
functionality supports changing the argument values from the AI Conductor UI. - O: The argument value can be changed from the AI Conductor UI by writing the argument name under
ui_args
in the experimental_plan.yaml. - X: The
ui_args
functionality is not supported. - For detailed explanation of
ui_args
, refer to the Write UI Parameter guide.
Mandatory User Setting
- The 'Mandatory User Setting' column indicates whether users must check and change the user argument to run AI Contents.
- O: Typically, it contains arguments where users must input information about the task and data before modeling.
- X: If the user does not change the value, modeling will proceed with the default value.
Asset Name | Argument Type | Argument Name | Default | Description | Mandatory User Setting | ui_args |
---|---|---|---|---|---|---|
Input | Required | file_type | csv | Specify the file extension of the input data. | X | O |
Input | Required | encoding | utf-8 | Specify the encoding type of the input data. | X | O |
Readiness | Required | x_columns | - | Enter the names of the x columns for training. | O | O |
Readiness | Required | y_column | - | Enter the name of the y column. | O | O |
Readiness | Required | task_type | classification | Specify whether it is a classification or regression task. | O | O |
Readiness | Required | target_label | _major | Enter the class name used as the metric calculation criterion for HPO. | X | X |
Readiness | Required | column_types | auto | Enter the column types (categorical/numeric). 'auto' provides an automatic column type classification feature. | X | X |
Readiness | Required | report | True | A summary CSV for the train/inference data will be generated. | X | O |
Readiness | Custom | drop_x_columns | - | Use this instead of x_columns if there are many column names to enter. | X | O |
Readiness | Custom | groupkey_columns | - | Group the dataframe based on the value of the entered column. | X | O |
Readiness | Custom | min_rows | See the description | Specify the minimum number of rows required for training. | X | X |
Readiness | Custom | cardinality | 50 | Specify the cardinality value for classifying categorical columns. | X | X |
Readiness | Custom | num_cat_split | 10 | Adjust the classification criteria used in the automatic column type classification. | X | X |
Readiness | Custom | ignore_new_category | False | Handle the situation where new categorical values are encountered during inference. | X | X |
Preprocess | Custom | save_original_columns | True | Decide whether to retain the original training columns (x_columns) in the preprocess asset result dataframe. | X | O |
Preprocess | Custom | categorical_encoding | {binary: all} | Specify the encoding method to apply to categorical columns. | X | X |
Preprocess | Custom | handle_missing | See the description | Specify the missing value handling method to apply to columns. | X | X |
Preprocess | Custom | numeric_outlier | - | Select the method for removing outliers from numeric columns. | X | X |
Preprocess | Custom | numeric_scaler | - | Select the scaling method to apply to numeric columns. | X | X |
Sampling | Required | data_split | {method: cross_validation, options: 3} | Select the method for splitting the train/validation set during HPO. | X | X |
Sampling | Custom | over_sampling | - | Apply an over-sampling method for the y column labels. | X | X |
Sampling | Custom | under_sampling | - | Apply an under-sampling method for the y column labels. | X | X |
Sampling | Custom | random_state | - | A random seed is specified to obtain consistent results when performing sampling. | X | X |
Train | Required | evaluation_metric | auto | Select the evaluation metric to choose the best model during HPO. | X | O |
Train | Required | shapley_value | False | Decide whether to calculate and output shapley values in output.csv. | X | O |
Train | Required | output_type | all | Decide whether to output minimal columns (modeling results) in output.csv. | X | O |
Train | Custom | model_list | [rf, gbm, lgbm, cb, xgb] | Select the models to compare during HPO. | X | X |
Train | Custom | hpo_settings | See the description | Modify parameters for the models in model_list. | X | X |
Train | Custom | shapley_sampling | 10000 | Specify the sampling rate for calculating shapley values. | X | X |
Train | Custom | multiprocessing | False | Specify whether to use multiprocessing. | X | O |
Train | Custom | num_cpu_core | 3 | Specify the number of CPU cores to use during multiprocessing. | X | O |
Detailed Description of User Arguments
Input asset
file_type
Specify the file extension of the input data. Currently, only CSV files are supported for AI Solution development.
- Argument Type: Required
- Input Type
- string
- Possible Values
- csv (default)
- Usage
- file_type: csv
- ui_args: O
encoding
Specify the encoding type of the input data. Currently, only utf-8 encoding is supported for AI Solution development.
- Argument Type: Required
- Input Type
- string
- Possible Values
- utf-8 (default)
- Usage
- encoding: utf-8
- ui_args: O
Readiness asset
x_columns
Enter the names of the x columns in the dataframe for training in list format. Users must input this according to their data. If there are many column names to enter, you can use the custom argument drop_x_columns to designate the entire dataframe columns as training columns. Only one of x_columns and drop_x_columns should be used (Remove or comment out the unused argument in the YAML file).
- Argument Type: Required
- Input Type
- list
- Possible Values
- List of column names
- Usage
- x_columns: [col1, col2]
- ui_args: O
y_column
Enter the name of the y column in the dataframe. Users must input this according to their data.
- Argument Type: Required
- Input Type
- string
- Possible Values
- Column name
- Usage
- y_column: target
- ui_args: O
task_type
Specify the type of the solution task (classification/regression). Users must check and set the value according to the purpose of the task.
- Argument Type: Required
- Input Type
- string
- Possible Values
- classification (default)
- regression
- Usage
- task_type: classification
- ui_args: O
target_label
When training a classification model, this specifies the class of the y_column to be used as the evaluation metric calculation criterion during HPO. For example, if evaluation_metric is precision and target_label is 1, the model with the highest precision value for label 1 is selected as the best model. It does not work when task_type is regression.
- Argument Type: Required
- Input Type
- string
- list
- Possible Values
- _major (default)
- Select the class with the most occurrences in the y_column. (Both binary and multiclass are possible)
- _minor
- Select the class with the fewest occurrences in the y_column. (Both binary and multiclass are possible)
- _all
- Use all class names in the y_column as the criterion. (Only multiclass is possible)
- Class name
- Enter one of the class names in the y_column. (Both binary and multiclass are possible)
- e.g., target_label: setosa
- List of class names
- list: Enter multiple class names in the y_column. (Only multiclass is possible)
- e.g., target_label: [setosa, versicolor]
- _major (default)
- Usage
- target_label: _major
- ui_args: X
column_types
Enter whether the training columns (x_columns) are numeric or categorical. The default value 'auto' allows the readiness asset to automatically classify whether the x_columns are numeric or categorical. If you need to always specify certain columns as numeric or categorical, use the column_type arguments as follows.
- e.g., column_types: {categorical_columns: [col1, col2]}
- e.g., column_types: {numeric_columns: [col1, col2]}
- e.g., column_types: {categorical_columns: [col1], numeric_columns: [col2]}
Columns entered in column_types are always classified as the specified type, while columns not entered will be automatically classified by the auto logic as numeric or categorical.
- Argument Type: Required
- Input Type
- string
- dictionary
- Possible Values
- auto (default)
- {categorical_columns: List of column names, numeric_columns: List of column names}
- Usage
- column_types: auto
- ui_args: X
report
It is determined whether to generate a summary CSV file for the input data (train/inference). The type of data (categorical/numeric), category information, cardinality, statistics, number of missing values, and missing rate will be recorded. ignore_new_category: When set to True, if category data that was not used in training is encountered during inference, a 'new-categories' column will be created.
- Argument Type: Required
- Input Type
- boolean
- Possible Values
- True (default)
- {train/inference}_artifacts/extra_output/readiness/report.csv will be generated.
- False
- report.csv will not be generated.
- True (default)
- Usage
- report: True
- ui_args: O
drop_x_columns
If there are many column names to enter, use drop_x_columns instead of x_columns to designate the entire dataframe columns as training columns after dropping specific columns. Only one of x_columns and drop_x_columns should be used (Remove or comment out the unused argument in the YAML file). When drop_x_columns is [], all dataframe columns except groupkey_columns and y_column are used as training columns. When entering a list of columns to drop in drop_x_columns, the remaining columns except groupkey_columns and y_column are used as training columns. e.g., If the entire columns are x0,x1,x2,x3,x4,y and drop_x_columns=[x0], groupkey_columns=[x1], y_column=y, the training columns will be x2,x3,x4.
- Argument Type: Custom
- Input Type
- list
- Possible Values
- []
- Enter an empty list to use all dataframe columns except groupkey_columns and y_column as training columns.
- List of column names
- Use the remaining columns except groupkey_columns, y_column, and the list of column names as training columns.
- []
- Usage
- drop_x_columns: []
- ui_args: O
groupkey_columns
The groupkey function analyzes data by grouping it based on the value of specific columns. If the 'groupkey col' column is designated as the groupkey column in the table below, data with 'groupkey col' value A and B are modeled separately. In this case, 'groupkey col' is the groupkey column, and A,B are groupkeys. Use the groupkey_columns argument to apply the groupkey function.
x0 | ... | x10 | groupkey col |
---|---|---|---|
.. | .. | .. | A |
.. | .. | .. | A |
.. | .. | .. | B |
.. | .. | .. | B |
.. | .. | .. | A |
If you enter multiple column names in groupkey_columns, the readiness asset generates a single integrated groupkey column by concatenating each value of groupkey_columns. For example, if you enter groupkey_columns: \[Gender, Pclass]
, a new groupkey column 'Gender_Pclass' is added to the input dataframe. However, for classification, the y_column's class types must be the same for each groupkey. Groups (groupkeys) that do not meet this condition are excluded from the training. Therefore, if the y_column values are A,B,C and a specific group's y_column values are only A,B, that group is excluded from the training. Other groups that do not meet the training conditions are also excluded by the readiness asset.
- Argument Type: Custom
- Input Type
- list
- Possible Values
- List of column names
- Usage
- groupkey_columns: [col1, col2]
- ui_args: O
min_rows
It specifies the minimum number of rows required for training. If the training data does not meet the minimum row count, it will cause an error. The default values are set to 30 for classification and 100 for regression under task_type. If you use the default value, there must be at least 30 data per y label for classification, and the total number of data must be at least 100 for regression to be possible. If the user inputs a min_rows value, for classification, there must be at least as many data per y label as the entered min_rows, and for regression, the total number of data must be at least as many as the entered min_rows for training to be possible. For example, if the user does not use the default value and adds an argument to experimental_plan.yaml as min_rows: 50
, for classification, there must be at least 50 data for each y label, and for regression, training will only proceed if there are at least 50 data in total.
When using the groupkey function (groupkey_columns), groups (groupkeys) that do not meet the min_rows condition are excluded from the training. If all groupkeys do not meet the min_rows condition, an error is raised.
- Argument Type: Custom
- Input Type
- int
- Possible Values
- default
- 30 (classification, there must be at least 30 instances of each y_column class)
- 100 (regression, the total number of instances must be at least 100)
- Number value
- default
- Usage
- min_rows: 50
- ui_args: X
cardinality
The cardinality argument specifies the cardinality condition that categorical_columns must meet when using the automatic classification feature for categorical/numeric columns (column_types: auto). If the unique values of a categorical column are equal to or less than the cardinality argument value, the column is classified as a categorical column. If the unique values exceed the cardinality argument value, the column is not classified as a categorical column and is excluded from the training columns.
- Argument Type: Custom
- Input Type
- int
- Possible Values
- 50 (default)
- Number value
- Usage
- cardinality: 50
- ui_args: X