Skip to main content
Version: Next

TCR Input and Artifacts

Updated 2024.08.07

Data Preparation

Training Data Preparation

To use TCR, you need to prepare tabular data with columns for training features (x columns) and a label column (y column). Currently, only .csv file formats are supported.

Training Dataset Example

x_0x_1...x_9y
valuevalue...valueclass
...............

Input Data Directory Structure Example

  • To use ALO, separate files for training and inference data are required. Organize the data as shown below.
  • All files in a single folder are combined into a single dataframe by the input asset and used for modeling (files in subfolders are also combined).
  • All files in a folder must have the same column names.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv

Data Requirements

Mandatory Requirements

The input data must meet the following conditions:

IndexItemSpecification
1Number of y columns1
2Number of classes (labels) in y column2 or more

Additional Requirements

These conditions help ensure minimum performance. The algorithm will still run if these conditions are not met, but performance is not guaranteed.

IndexItemSpecificationNote
1Minimum amount of training dataclassification: at least 30 samples per class / regression: at least 100 samplesThe default values can be adjusted using the min_rows user argument, but it is recommended to use data that meets or exceeds the default values to ensure model performance.
2No unseen categorical data during inferenceYesThis behavior can be controlled using the ignore_new_category user argument, but retraining is recommended if this occurs.

Artifacts

Executing training/inference generates the following artifacts.

Train Pipeline

./alo/train_artifacts/
└ log/pipeline.log
└ models/train/
└ best_model.pkl
└ model_selection.json
└ model_selection.csv
└ ...
└ output/
└ output.csv
└ extra_output
└ readiness
└ report.csv
└ train
└ eval_result.csv
└ summary_plot.png (if Shapley value output is enabled)

Inference Pipeline

./alo/inference_artifacts/
└ log/pipeline.log
└ output/inference/
└ output.csv
└ extra_output
└ readiness/
└ report.csv
└ inference/
└ eval_result.csv (in cases where the y label is present)
└ summary_plot.png (if Shapley value output is enabled)
   └ score/
└ inference_summary.yaml

Description of Artifacts

pipeline.log

Logs for each TCR asset in the pipeline. Checking the log file content provides information about the methods applied by each asset.

# Example log from the readiness asset for TCR's sample_data/train_titanic dataset.
...
[2024-04-09 01:53:46,045|USER|INFO|readiness.py(654)|save_info()] Column Pclass is classified as a numeric column.
[2024-04-09 01:53:46,055|USER|WARNING|readiness.py(657)|save_warning()] Column Name has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,063|USER|INFO|readiness.py(654)|save_info()] Column Sex is classified as a categorical column.
[2024-04-09 01:53:46,072|USER|INFO|readiness.py(654)|save_info()] Column Age is classified as a numeric column.
[2024-04-09 01:53:46,081|USER|INFO|readiness.py(654)|save_info()] Column SibSp is classified as a numeric column.
[2024-04-09 01:53:46,089|USER|INFO|readiness.py(654)|save_info()] Column Parch is classified as a numeric column.
[2024-04-09 01:53:46,099|USER|WARNING|readiness.py(657)|save_warning()] Column Ticket has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,107|USER|INFO|readiness.py(654)|save_info()] Column Fare is classified as a numeric column.
[2024-04-09 01:53:46,116|USER|WARNING|readiness.py(657)|save_warning()] Column Cabin has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,126|USER|INFO|readiness.py(654)|save_info()] Column Embarked is classified as a categorical column.
[2024-04-09 01:53:46,135|USER|INFO|readiness.py(654)|save_info()] Among training columns, ['Sex', 'Embarked'] are classified as categorical columns.
[2024-04-09 01:53:46,144|USER|INFO|readiness.py(654)|save_info()] Among training columns, ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] are classified as numeric columns.
[2024-04-09 01:53:46,153|USER|INFO|readiness.py(654)|save_info()] Column Survived is classified as a categorical column.
...

best_model.pkl

The model file trained with the best parameters selected through HPO on the entire training data.

model_selection.json

Contains information related to HPO, including:

  • best_model_info: The best model and parameters selected by HPO.
  • data_split: The data split method used during HPO.
  • evaluation_metric: The evaluation metric.
  • target_label: The y class used as the HPO criterion.
  • x_columns: The training target x columns.
  • model_score: The HPO model candidates, including model names and parameters.

model_selection.csv

model_selection.csv has been created for easier viewing. The content is the same as in model_selection.json.

output.csv

Contains prediction results from the model, including predicted values and classification probabilities.

  • origin_index column: The index from the user's input dataframe.
  • prep_{x column}: Preprocessed x columns from the preprocess asset. One column per x column.
  • TCR-shap_{x column}: Shapley values for the training data, if enabled. One column per x column.
  • TCR-prob_{y column class name}: The model's predicted probability for each y column class. One column per class.
  • TCR-pred_{y column}: The model's prediction result.

report.csv

This is a summary table of the data information as a result of the readiness asset. The first row of report.csv represents information about the entire dataset. When ignore_new_category: True, if category data that was not used in training is encountered during inference, a 'new-categories' column will be created.

  • varname column: The name of the variable.
  • role column: Indicates the columns used in training.
  • dtype: Indicates the data type.
  • TCR-column-type: Indicates the type of data (categorical/numeric).
  • categorical-info: Provides information about the categories within a categorical column.
  • cardinality: The number of unique values in a categorical column.
  • numeric-info-min, max, mean, median: Statistical measures for numeric columns.
  • missing-values-num: The number of missing values.
  • missing-values-ratio: The ratio of missing values.
  • new-categories (inference, ignore_new_category: True): Indicates category data not used in training.

eval_result.csv

eval_result.csv contains evaluation metrics for each class. It only functions during inference if the y column is present.

  • label column: Contains the label values of the y column.
  • Accuracy column: Contains the accuracy score (common for all labels).
  • F1-Score column: Contains the F1 score for each label.
  • Recall column: Contains the recall value for each label.
  • Precision column: Contains the precision value for each label.

summary_plot.png

A summary plot of Shapley values, if enabled.

  • Shows the impact of each x column's data on the y column class.
  • Computes the impact by averaging the absolute values of the Shapley values.

train_summary.yaml

A summary file shown in the Edge Conductor UI, created using the ALO save_summary API. For details, refer to ALO API: save_summary. The file contains the following information:

  1. Classification
  • result: Indicates whether it is Binary or Multi classification ('Binary-Classification' / 'Multi-Classification').
  • score: Confidence score using entropy.
  • note: Explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'.
  • probability: {}
  1. Regression
  • result: 'regression'.
  • score: The R^2 score between target and predicted values.
  • note:
    • R^2 score < 0:
      • WARNING!! R2 < 0: R2 had returned a negative value. Reconsider your model or data.
    • R^2 score = 1:
      • WARNING!! R2 = 1: The model may be overfitted to the training data. Reconsider your data.
    • 0 < R^2 score < 1:
      • The independent variables (x_columns) and the model account for about {score}% of the data.
  • probability: {}

inference_summary.yaml

  1. Classification
  • When the number of inference data is 1

    • result: The inference data judgment result (label)
      • ex) result: OK
    • score: Confidence score using entropy.
    • note: Explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'.
    • probability: {label: The probability of the model's label classification} value is entered.
      • ex) {NG: 0.2, OK: 0.8}
  • When the number of inference data is 2 or more

    • It is the same as train_summary.yaml.
  1. Regression Not supported. The message is:
  • result: 'regression'.
  • score: 0.
  • note: 'Not supported - 24/03/12'.
  • probability: {}

TCR Version: 2.2.1