TCR Features
Feature Overview
TCR's pipeline
The pipeline of AI Contents consists of a combination of functions, which are functional units. Train pipeline and Inference pipeline are all composed of 5 steps.
Train pipeline
Input - Readiness - Preprocess - Sampling - Train
Inference pipeline
Input - Readiness - Preprocess - Sampling - Inference
Input step
It reads the file in the path specified by the user in the experimental_plan.yaml and makes it into a dataframe.
Readiness step
Checks whether the data entered by the user into the TCR is suitable for TCR modeling.
Preprocess step
Apply data preprocessing methods for TCR modeling, such as categorical column encoding and missing value processing, to the data.
Sampling step
If there is an imbalance in the data labels, apply over sampling or under sampling to balance the data.
Train step
HPO is performed with 5 models built into TCR, the best model is selected, the model is trained, and the output is returned.
Inference step
The model generated by Train step infers data for inference and returns the output.
Tips for using
Check the log file after TCR's auto mode progress
If you enter only the data path, X column, and Y column information into experimental_plan.yaml, and run alo, TCR will perform the inspection and preprocessing of the entered data. If you check the log files afterwards, you can see the following: The log file is 'workspace/tcr/log/process.log'. In process.log, you can check the output log for each function. The image below is an example of a log generated by the Readiness function when running TCR. In pipeline.log, you can check the output log for each step.
# This is the result of the readiness function of the train_dataset/train.csv titanic data of TCR.
***************************** Invoke Pipline Function *****************************
* Target File : /home/gy90.moon/tcr_v3_check3/pipeline.py
* function[name] : readiness
* function[name].def : pipeline.readiness
* function[name].argument : {'x_columns': ['input_x0', 'input_x1', 'input_x2', 'input_x3'], 'y_column': 'target', 'task_type': 'classification', 'target_label': '_major', 'column_types': 'auto'}
* summary :
***********************************************************************************
[2025-04-30 00:23:10,257|root|DEBUG|logger.py(182)|decorator()] -------------------- Finish readinesspipline(0.02)
[2025-04-30 00:23:10,257|root|DEBUG|logger.py(173)|decorator()] -------------------- Start preprocess pipline
[2025-04-30 00:23:10,883|root|DEBUG|pipeline.py(95)|preprocess()] preprocess
[2025-04-30 00:23:10,883|root|INFO|preprocess.py(537)|save_info()] categorical_columns: []
[2025-04-30 00:23:10,883|root|INFO|preprocess.py(537)|save_info()] numeric_columns: ['input_x0', 'input_x1', 'input_x2', 'input_x3']
[2025-04-30 00:23:10,887|root|INFO|preprocess.py(537)|save_info()] shape of input data before filtering: (147, 10)
[2025-04-30 00:23:10,887|root|INFO|preprocess.py(537)|save_info()] shape of input_data after filtering: (147, 10)
[2025-04-30 00:23:10,889|root|INFO|preprocess.py(537)|save_info()] non_groupkey dataframe missing rate: 0.0
[2025-04-30 00:23:10,889|root|ERROR|preprocess.py(545)|save_error()] There seems to be an issue with creating the folder path: /home/gy90.moon/tcr_v3_check3/.workspace/tcr/model_artifacts. Please check if the path is correct and accessible.
[2025-04-30 00:23:10,891|root|INFO|tabular_preprocess.py(403)|save_info()] >>>>> Starting missing value handling (handle_missing).
[2025-04-30 00:23:10,891|root|INFO|tabular_preprocess.py(403)|save_info()] Applying median missing value handling methodology to the ['prep_input_x0', 'prep_input_x1', 'prep_input_x2', 'prep_input_x3'] column(s).
[2025-04-30 00:23:10,891|root|INFO|tabular_preprocess.py(403)|save_info()] >>>>> Starting the categorical encoding process.
[2025-04-30 00:23:10,900|root|INFO|tabular_preprocess.py(403)|save_info()] >>>>> Starting missing value handling (handle_missing).
[2025-04-30 00:23:10,900|root|INFO|tabular_preprocess.py(403)|save_info()] Applying drop missing value handling methodology to the ['prep_target'] column(s).
[2025-04-30 00:23:10,902|root|INFO|tabular_preprocess.py(403)|save_info()] >>>>> Starting the categorical encoding process.
[2025-04-30 00:23:10,902|root|INFO|tabular_preprocess.py(403)|save_info()] Applying label encoding methodology to the ['prep_target'] column(s). [2025-04-30 00:23:10,910|root|WARNING|alo.py(804)|make_summary()] [PIPELINE] Missing 'summary' key in 'preprocess' function result. Creating default summary.
[2025-04-30 00:23:10,910|root|WARNING|alo.py(814)|make_summary()] [PIPELINE] Missing required keys ['note', 'result', 'score'] in summary dict. Adding default values.
- Check the result of Readiness's column type classification
- After running ALO, check the log to see if the entered x column is well categorized into numeric and categorical columns.
- If a column is misclassified, specify the column type by entering the misclassified column in the user argument column_types in readiness.
- Check the cardinality test results of Readiness
- TCR sets a catdinality(default: 50) condition for categorical columns, and if the number of categories of categorical data exceeds the cardinality condition, it is excluded from the training column. As shown above, a log is output that the columns that do not meet the cardinality conditions are excluded from the x_columns.
- If the cardinality of the categorical data you are using is higher than the default value of 50, modify the user argument cardinality in the readiness
How to minimize memory when using large data
When using TCR, the amount of data passed from step to step increases depending on the categorical encoding pre-processing, application of data split methodology for HPO, and application of over sampling. If you are using data with a large file size, you can reduce memory usage by setting the settings below.
- Preprocess: catboost categorical encoding methodology changed
- The default categorical encoding methodology is binary encoding. Binary encoding does not increase the number of cateogories as much as onehot encoding, but the larger the categorical data and the larger the cardinality, the larger the data. Therefore, you can prevent the column from stretching by setting the user argument categorical_encoding: {catboost: all} to specify categorical encoding as catboost encoding.
- Sampling: Changed the data split methodology to train test split
- The output from the sampling step to the train step is the entire dataframe to be used for retraining and the dataset list for HPO.
- When using cross validation, the number of folds of data is copied and the data is moved to the next step.
Function Details
Train pipeline: Input step
The Input step takes all the files in the user data path specified in the experimental_plan.yaml and reads them into a single dataframe. Enter the user data path by train/inference in the 'dataset_uri' in the experimental_plan.yaml. The data path must be a folder path, not the file name. For example, a train pipeline fetches data from the 'dataset_uri' route in the train entry.
train:
dataset_uri: [train_dataset/] # Data folder, folder list (no file types)
inference:
dataset_uri: inference_dataset/
- If there is a folder under the path entered by the user, all the data in the folder will also be read and combined.
- All files under the path entered by the user must have the same columns.
- For a detailed explanation of experimental_plan.yaml and an explanation of input function parameter, please refer to the following guide TCR Parameter Guide
Train pipeline: Readiness step
The Readiness step checks that the data you use for training/inference is suitable for TCR modeling. Readiness step performs the necessary inspections for each train pipeline and inference pipeline, and the detailed items to be inspected for each pipeline are as follows.
Checklist
- Check if the column name created in experimental_plan.yaml exists in the data.
The column name entry by the user in the readiness step is as follows. The readiness step checks that the user has entered the name of the column in the dataframe for the arguments below. For detailed usage of each argument, please refer to the TCR Parameter Guide.
- x_columns: Column to be trained
- y_column: Label column
- groupkey_columns: Grouping the dataframe based on the value value of the entered column.
- drop_x_columns: Specifies the entire dataframe column as the trainee column, and excludes the column entered in the drop_x_columns as the trainee column. Use instead if you have many columns to put in the x_columns
The Readiness step provides the groupkey feature. The groupkey function is a way to analyze data by grouping it based on the value of a specific column. In the table below, if you specify 'groupkey col' as the groupkey column, the data with a value of A and the data with a value of B in the 'groupkey col' column will be modeled. In this case, 'groupkey col' becomes groupkey column, and A and B become groupkey. groupkey_columns argument allows you to use the groupkey function.
x0 | ... | x10 | groupkey col |
---|---|---|---|
.. | .. | .. | A |
.. | .. | .. | A |
.. | .. | .. | B |
.. | .. | .. | B |
.. | .. | .. | A |
- Checking user arguments combinations
Check for user arguments that can't be used together, and if the user enters the arguments incorrectly, it throws an error and stops the pipeline progress.
- If the x_columns contains groupkey_columns columns, it will be treated as an error.
- If the x_columns contains y_column, it will be treated as an error.
- If there is a column value except_x_columns x_columns at the same time, it will be treated as an error.
- Investigate the column type of the x column The Readiness Step uses built-in basic logic to investigate whether the user-specified x-column is a categorical column or a numeric column. Once categorical/numeric columns are classified in the readiness step, the next preprocess step applies the appropriate preprocessing methodology for categorical and numeric columns. Through the classification algorithm in the image below, the readiness step classifies the column type. If the automatic classification algorithm misclassifies the type of columns, the user can modify the user argument of the readiness step, column_types, to specify that some columns be categorical or numeric. In addition, the 'common topN cast' in the above logic can modify the default logic by modifying the num_cat_split arguments in the yaml file.
! logic
- Inspect the minimum number of data required to train If the minimum number of data required for classification or regression modeling is not secured, the readiness step will cause an error and stop the pipeline progress. This is to ensure minimum modeling performance. The current basic logic is as follows.
- For classification
- Make sure that the minimum number of data per label in the y-column is at least 30.
- regression
- Make sure that the total number of data is at least 100.
You can use the user arguments, min_rows, to modify the value of the minimum number of data required to train the data.
<When using the groupkey function> The minimum number of data required for training is checked per groupkey. Some groupkeys do not meet the training criteria, but if there are some trainable groupkeys, the readiness step does not throw an error and only passes the trainable groupkey data to the next step. And if the entire groupkey does not meet the training conditions, an error will be thrown.
- Inspect whether the y-column is configured with a single value Because the y column cannot be modeled if it consists of a single value, the readiness step throws an error if the y column consists of only one value.
- X-column missing value inspection If all of the data in a column designated as an x column consists of missing values, the column is excluded from the x column.
<When using the groupkey function> If you use the GroupKey feature, you may encounter a situation where certain columns are all missing values in some GroupKeys.
x0 | ... | x10 | groupkey col |
---|---|---|---|
.. | .. | 1 | A |
.. | .. | 2 | A |
.. | .. | NaN | B |
.. | .. | NaN | B |
.. | .. | 2 | A |
In the above case, instead of deleting column 'x10', fill in all of the 'x10' columns in groupkey B with the following logic.
- If the missing column is a categorical column
- Fill the missing value of the Group B column with the most frequents of x10.
- If the missing column is a numeric column
- Fill in the missing value of the Groupkey B column with a median value of x10.
- When using the GroupKey column function, check the number of Y column classes If you use GroupKey columns, all GroupKey Y columns must have the same class type. That is, the y-column class type in groupkey A and the y-column class type in B must be the same. GroupKeys that do not meet these criteria will be excluded from the trained groupkey.
Train pipeline: Preprocess step
In the preprocess step, we apply data casting and preprocessing methodologies to the training data. The preprocess step provides the following four pre-treatments.
- Handling missing values (argument: [handle_missing](./parameter#handle_missing>
- Applicable to categorical columns: frequent (Frequency Value Filling)
- Applicable to numeric columns: mean, median, interpolation
- Applicable to all columns: drop (delete missing rows), fill_{value} (fill in missing values with {value})
- categorical encoding(argument: [categorical_encoding](./parameter#handle_missing>
- binary, catboost, onehot, label encoding
- numeric data scaling(argument: [numeric_scaler](./parameter#numeric_scaler>
- standard, minmax, robust, maxabs, normalizer
- Remove numeric data outlier (argument: [numeric_outlier](./parameter#numeric_outlier>
- Normal (removes outliers greater than 3 sigma from the current data distribution)
The preprocess step has a default rule by default and applies it to the categorical and numeric columns checked by the readiness step. Data casting is applied to categorical and numeric irons, and categorical encoding and missing value processing methodologies are applied. The default preprocessing rule for the preprocess step is as follows:
<default preprocessing applied to columns>
- Missing teeth treatment
- Populate categorical columns with the most frequent values. (frequent)
- Numeric columns are populated with median values. (median)
- categorical encoding
- Apply binary encoding.
<y default preprocessing applied to columns>
- Missing teeth treatment
- Remove the missing row (drop). (cannot be changed to user arguments)
- categorical encoding
- Apply label encoding. (Applies at the time of classification, cannot be changed to user arguments)
The default logic applied to the x column can be modified by adding arguments to the yaml file. For specific writing methods, please click the methodology link above and refer to the argument explanation in the TCR parameter guide. TCR Parameter Guide.
Train pipeline: Sampling step
The sampling step has two main functions. First, the data split function is created by sampling the dataset for HPO in the train step, and the sampling methodology is applied to handle the imbalanced data.
- Data split
In the sampling step, the train step generates the data set required for HPO progress and delivers it to the train step. There are two methodologies: cross validation and train test split, and you can change it to data_split argument in the YAML file.
-
cross validation
-
For HPO, use the cross-validation methodology to split the data into train sets and validation sets.
-
The YAML file has the cross validation / 3 fold option as the default value.
-
`[[train_fold1, validation_fold1], [train_fold2, validation_fold2], ...]' is passed to the next step.
-
When using the data split function and the sampling function together, the sampling methodology is applied to the train set of each fold.
-
Since the data is copied and passed to the next step, the number of folds should be taken into consideration.
-
train test split
-
Divide the train set and validation set into a set ratio.
-
When using the data split function and the sampling function together, the sampling methodology is applied to the train set of each fold.
-
[[train set, validation set]].
- Sampling
You can over-sample or under sample based on a specific label on the Y column. It is a format that selects a sampling halabelle and enters the sampling rate.
- over sampling
- Random, smote methodology is available.
- For more information on how to use argument, see over_sampling
- under sampling
- You can use random, nearmiss methodologies.
- For more information on how to use argument, see under_sampling
Train pipeline: Train step
TCR's train/inference step has a total of 5 models built-in. Data analysts created the classification/regression model and selected five frequently used models and a set of parameters for each model. TCR's model list and parameter set are as follows:
TCR Built-in Models
- rf: random forest
- (max_depth: 6, n_estimators: 300), (max_depth: 6, n_estimators: 500)
- gbm: gradient boosting machine
- (max_depth: 5, n_estimators: 300), (max_depth: 6, n_estimators: 400), (max_depth: 7, n_estimators: 500)
- lgbm: light gradient boosting machine
- (max_depth: 5, n_estimators: 300), (max_depth: 7, n_estimators: 400), (max_depth: 9, n_estimators: 500)
- cb: catoost
- (max_depth: 5, n_estimators: 100), (max_depth: 7, n_estimators: 300), (max_depth: 9, n_estimators: 500)
- xgb: Extreme Gradient Boosting
- (max_depth: 5, n_estimators: 300), (max_depth: 6, n_estimators: 400), (max_depth: 7, n_estimators: 500)
How to add a model
Create and add model files in the following order:
- Copy the code above to create a 'Model.py' file.
- Modify the model import section at the top.
- Create a DEFAULT_PARAM with the parameter values provided by the model.
- Check the parameters provided by the model library and write the search range for the parameters you want to find with HPO in DEFAULT_PARAM.
- Each parameter can be written as a single value or as a list.
- 'tcr_param_mix': 'one_to_one' creates a parameter pair between list elements in the same location. (In the example above, two sets of parameters are created for the max_depth and min_samples_split parameters: (3,2) and (6,3).)
- 'tcr_param_mix': If set to 'all', a parameter pair is created by combining all the list elements of each parameter. (In the example above, four sets of parameters are created for the max_depth and min_samples_split parameters: (3,2), (6,2), (3,3), and (6,3).)
- Modify the model call portion of the class TCR_model.
- Put the imported model in the modelvariable.
- Implement fit, load, and predict functions if necessary (you don't need to do so if you use the scikit-learn model).
- Enter the 'model .py' in the path below.
- If it is a classification model, place 'Model Name .py' under tcr_modeling/src_tcr/classification_model/.
- For regression models tcr_modeling Place 'Modelname.py' under /src_tcr/regression_model/.
- Add an argument model_list to the train function in experimental_plan.yaml to drive alo.
- model_list include the 'model name' of 'model name.py' in the list. ex) model_list: [model name], model_list: [model name, rf] (the 'model name' model and the built-in rf are enabled)
How to add a model
If you want to use a model other than the 5 models built into TCR, please refer to the code below to create the model file you want. By adding the created model file to the model file path in TCR, you can easily proceed with HPO of newly added models and models built into existing TCRs. However, we recommend that you only add the models provided by Scikit-learn. The model library should have built-in fit, load, and predict functions. Below is the code for the Random forest model file built into TCR.
< Random forest file code (rf.py) of TCR base model >
from sklearn.ensemble import RandomForestClassifier # model file import
from .classifier import Classfier
# Set the default parameter. Refer to the parameter documentation of the model you want to use.
DEFAULT_PARAM = {
'max_depth': 6,
'n_estimators': [300, 500],
'random_state': 1234,
'n_jobs': 1,
'tcr_param_mix': 'one_to_one',
}
class TCR_model(Classfier):
def __init__(self, model_name, model_type, param_dict):
model = RandomForestClassifier(**param_dict) # call the imported model
super().__init__(model_name, model_type, model)
The code below is an example of using the rf.py above to generate a decision tree model file for scikit-learn.
<Decision Tree Model File Generation (dt.py)>
from sklearn.tree import DecisionTreeClassifier # (1) model import
from .classifier import Classfier
DEFAULT_PARAM = { # (2) Modify DEFAULT_PARAM with the parameters provided by the DecisionTreeClassifier
'max_depth': [3,6],
'min_samples_split': [2,3],
'tcr_param_mix': 'one_to_one'
}
class TCR_model(Classfier):
def __init__(self, model_name, model_type, param_dict):
model = DecisionTreeClassifier(**param_dict) # (3) Partial modification of model call
super().__init__(model_name, model_type, model)
Create and add model files in the following order:
- Copy the code above to create a 'Model.py' file.
- Modify the model import section at the top.
- Create a DEFAULT_PARAM with the parameter values provided by the model.
- Check the parameters provided by the model library and write the search range for the parameters you want to find with HPO in DEFAULT_PARAM.
- Each parameter can be written as a single value or as a list.
- 'tcr_param_mix': 'one_to_one' creates a parameter pair between list elements in the same location. (In the example above, two sets of parameters are created for the max_depth and min_samples_split parameters: (3,2) and (6,3).)
- 'tcr_param_mix': If set to 'all', a parameter pair is created by combining all the list elements of each parameter. (In the example above, four sets of parameters are created for the max_depth and min_samples_split parameters: (3,2), (6,2), (3,3), and (6,3).)
- Modify the model call portion of the class TCR_model.
- Put the imported model in the modelvariable.
- Implement fit, load, and predict functions if necessary (you don't need to do so if you use the scikit-learn model).
- Enter the 'model .py' in the path below.
- First, to access the code of the TCR, you need to run the TCR once.
- If it is a classification model, place 'Modelname.py' under ./tcr_modeling/src_tcr/classification_model/.
- For regression models, place 'modelname.py' under ./tcr_modeling/src_tcr/regression_model/.
- Add a train step argument in experimental_plan.yaml to drive alo.
- model_list include the 'model name' of 'model name.py' in the list as an argument value. ex) model_list: [model name], model_list: [model name, rf] (the 'model name' model and the built-in rf are enabled)#### HPO function TCR's HPO approach is as follows:
- Data partitioning for HPO
- Cross-validation or random sampling to distinguish train/validation sets. The default is 5 fold cross-validation, and can be set with the data_split argument.
- Compare candidate model performance based on evaluation metrics
- The default rules of the evaluation metric are accuracy at the time of classification and MSE at the time of regression. The user can specify the evaluation metric via the evaluation_metric argument.
- You can choose a highlighting label to weight the evaluation metric. This is the readiness steplabel](./parameter#target_label) argument. For each task, there are situations where you need to evaluate the model by focusing on a specific class in the y-column. For example, if it is important to match NG data well, if you set target_label to NG and evaluation_metric to recall, the model parameter with the highest recall value of NG class among the models in TCR will be selected as the best model & parameter.
- If multiple models with the same evaluation_metric value appear in the HPO process, the model is prioritized according to the following priority.
- When the evaluation_metric values are the same:
- For Classificaiton, compare the remaining metrics by model, except for evaluation_metric, in the order of accuracy, f1, recall, and precision. (If you select accuracy, compare the values in the order of f1, recall, precision)
- For Regression, compare the remaining metrics by model except for evaluation_metric in the order of R2, MSE, MAE, and RMSE.
- When all metrics are equal:
- The smaller the model size, the more similar the model size, the RF, LGBM, GBM, XGB, and CB models are selected.
- However, when all metrics are equal, the model added by the user has the highest priority if the user added the model themselves.
- Retrain the entire training data with the selected model and the model's parameter setting
- In HPO, the model does not use a fraction of the total training data to train (validation set). Therefore, it is necessary to train the model with the whole data after the best model is selected.
By modifying/adding the user argument in TCR, users can perform various HPO modeling experiments. The modeling-related user arguments are shown below, and you can see a detailed explanation by going to the link.
- evaluation_metric
- Specify the evaluation metric to use at HPO.
- model_list
- You can choose only some models to proceed with HPO.
- data_split
- You can specify the data split methodology for HPO.
- hpo_settings
- You can specify a parameter set for each model.
XAI function
TCR provides a shapley value calculation function that allows you to check how a specific variable in each data affects the y value. [shapley_value]If shapley_value you change the argument to True, you can see the shapley value for each training column in output.csv, which is the analysis output of TCR.
Inference pipeline: Input step
It works the same as train pipeline, and the inference pipeline fetches the data of the 'dataset_uri' route in the inference entry.
inference:
dataset_uri: inference_dataset/
Inference pipeline: Readiness step
Checklist
- Categorical data inspection during inference For categorical columns in the X-column used for training, categorical encoding is applied in the preprocess step. If the categorical column contains a value that is not used by the train during inference, it cannot be encoded and cannot be learned. Therefore, the readiness step checks for new values at inference in the case of categorical columns. The default logic of the readiness step terminates the pipeline by throwing an error when a new value comes in during inference. This is to notify users that new category values that are not used for training have been introduced during inference, and to recommend re-trains. This default setting also allows you to modify the logic by adding a user argument ignore_new_category to the yaml file. If you add 'ignore_new_category: True' to the yaml file, it will be processed and the infernece result will be printed.
- Check the new groupkey value during inference (Operated when using the groupkey function) If a new groupkey value is introduced that is different from the groupkey value used for training, the groupkey cannot be inferred because there is no model trained based on the new groupkey. Therefore, the new groupkey value is excluded from inference.
Inference pipeline: Preprocess step
Load the scikit-learn-based preprocessing model created by the preprocess step of the train pipeline to apply preprocessing on the inference data.
Inference pipeline: Inference step
In the inference step, the model trained in the train step is imported and the inference dataset is inferred.
TCR Version: 3.0.0