TCR Input and Artifacts
Data Preparation
Training Data Preparation
To use TCR, you need to prepare tabular data with columns for training features (x columns) and a label column (y column). Currently, only .csv
file formats are supported.
Training Dataset Example
x_0 | x_1 | ... | x_9 | y |
---|---|---|---|---|
value | value | ... | value | class |
... | ... | ... | ... | ... |
Input Data Directory Structure Example
- To use ALO, separate files for training and inference data are required. Organize the data as shown below.
- All files in a single folder are combined into a single dataframe by the input asset and used for modeling (files in subfolders are also combined).
- All files in a folder must have the same column names.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv
Data Requirements
Mandatory Requirements
The input data must meet the following conditions:
Index | Item | Specification |
---|---|---|
1 | Number of y columns | 1 |
2 | Number of classes (labels) in y column | 2 or more |
Additional Requirements
These conditions help ensure minimum performance. The algorithm will still run if these conditions are not met, but performance is not guaranteed.
Index | Item | Specification | Note |
---|---|---|---|
1 | Minimum amount of training data | classification: at least 30 samples per class / regression: at least 100 samples | The default values can be adjusted using the min_rows user argument, but it is recommended to use data that meets or exceeds the default values to ensure model performance. |
2 | No unseen categorical data during inference | Yes | This behavior can be controlled using the ignore_new_category user argument, but retraining is recommended if this occurs. |
Artifacts
Executing training/inference generates the following artifacts.
Train Pipeline
./alo/train_artifacts/
└ log/pipeline.log
└ models/train/
└ best_model.pkl
└ model_selection.json
└ model_selection.csv
└ ...
└ output/
└ output.csv
└ extra_output
└ readiness
└ report.csv
└ train
└ eval_result.csv
└ summary_plot.png (if Shapley value output is enabled)
Inference Pipeline
./alo/inference_artifacts/
└ log/pipeline.log
└ output/inference/
└ output.csv
└ extra_output
└ readiness/
└ report.csv
└ inference/
└ eval_result.csv (in cases where the y label is present)
└ summary_plot.png (if Shapley value output is enabled)
└ score/
└ inference_summary.yaml
Description of Artifacts
pipeline.log
Logs for each TCR asset in the pipeline. Checking the log file content provides information about the methods applied by each asset.
# Example log from the readiness asset for TCR's sample_data/train_titanic dataset.
...
[2024-04-09 01:53:46,045|USER|INFO|readiness.py(654)|save_info()] Column Pclass is classified as a numeric column.
[2024-04-09 01:53:46,055|USER|WARNING|readiness.py(657)|save_warning()] Column Name has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,063|USER|INFO|readiness.py(654)|save_info()] Column Sex is classified as a categorical column.
[2024-04-09 01:53:46,072|USER|INFO|readiness.py(654)|save_info()] Column Age is classified as a numeric column.
[2024-04-09 01:53:46,081|USER|INFO|readiness.py(654)|save_info()] Column SibSp is classified as a numeric column.
[2024-04-09 01:53:46,089|USER|INFO|readiness.py(654)|save_info()] Column Parch is classified as a numeric column.
[2024-04-09 01:53:46,099|USER|WARNING|readiness.py(657)|save_warning()] Column Ticket has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,107|USER|INFO|readiness.py(654)|save_info()] Column Fare is classified as a numeric column.
[2024-04-09 01:53:46,116|USER|WARNING|readiness.py(657)|save_warning()] Column Cabin has more than 50 unique values, excluding from x_columns.
[2024-04-09 01:53:46,126|USER|INFO|readiness.py(654)|save_info()] Column Embarked is classified as a categorical column.
[2024-04-09 01:53:46,135|USER|INFO|readiness.py(654)|save_info()] Among training columns, ['Sex', 'Embarked'] are classified as categorical columns.
[2024-04-09 01:53:46,144|USER|INFO|readiness.py(654)|save_info()] Among training columns, ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] are classified as numeric columns.
[2024-04-09 01:53:46,153|USER|INFO|readiness.py(654)|save_info()] Column Survived is classified as a categorical column.
...
best_model.pkl
The model file trained with the best parameters selected through HPO on the entire training data.
model_selection.json
Contains information related to HPO, including:
- best_model_info: The best model and parameters selected by HPO.
- data_split: The data split method used during HPO.
- evaluation_metric: The evaluation metric.
- target_label: The y class used as the HPO criterion.
- x_columns: The training target x columns.
- model_score: The HPO model candidates, including model names and parameters.
model_selection.csv
model_selection.csv has been created for easier viewing. The content is the same as in model_selection.json.
output.csv
Contains prediction results from the model, including predicted values and classification probabilities.
- origin_index column: The index from the user's input dataframe.
- prep_{x column}: Preprocessed x columns from the preprocess asset. One column per x column.
- TCR-shap_{x column}: Shapley values for the training data, if enabled. One column per x column.
- TCR-prob_{y column class name}: The model's predicted probability for each y column class. One column per class.
- TCR-pred_{y column}: The model's prediction result.
report.csv
This is a summary table of the data information as a result of the readiness asset. The first row of report.csv represents information about the entire dataset. When ignore_new_category: True, if category data that was not used in training is encountered during inference, a 'new-categories' column will be created.
- varname column: The name of the variable.
- role column: Indicates the columns used in training.
- dtype: Indicates the data type.
- TCR-column-type: Indicates the type of data (categorical/numeric).
- categorical-info: Provides information about the categories within a categorical column.
- cardinality: The number of unique values in a categorical column.
- numeric-info-min, max, mean, median: Statistical measures for numeric columns.
- missing-values-num: The number of missing values.
- missing-values-ratio: The ratio of missing values.
- new-categories (inference, ignore_new_category: True): Indicates category data not used in training.
eval_result.csv
eval_result.csv contains evaluation metrics for each class. It only functions during inference if the y column is present.
- label column: Contains the label values of the y column.
- Accuracy column: Contains the accuracy score (common for all labels).
- F1-Score column: Contains the F1 score for each label.
- Recall column: Contains the recall value for each label.
- Precision column: Contains the precision value for each label.
summary_plot.png
A summary plot of Shapley values, if enabled.
- Shows the impact of each x column's data on the y column class.
- Computes the impact by averaging the absolute values of the Shapley values.
train_summary.yaml
A summary file shown in the Edge Conductor UI, created using the ALO save_summary API. For details, refer to ALO API: save_summary. The file contains the following information:
- Classification
- result: Indicates whether it is Binary or Multi classification ('Binary-Classification' / 'Multi-Classification').
- score: Confidence score using entropy.
- note: Explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'.
- probability: {}
- Regression
- result: 'regression'.
- score: The R^2 score between target and predicted values.
- note:
- R^2 score < 0:
- WARNING!! R2 < 0: R2 had returned a negative value. Reconsider your model or data.
- R^2 score = 1:
- WARNING!! R2 = 1: The model may be overfitted to the training data. Reconsider your data.
- 0 < R^2 score < 1:
- The independent variables (x_columns) and the model account for about {score}% of the data.
- R^2 score < 0:
- probability: {}
inference_summary.yaml
- Classification
-
When the number of inference data is 1
- result: The inference data judgment result (label)
- ex) result: OK
- score: Confidence score using entropy.
- note: Explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'.
- probability: {label: The probability of the model's label classification} value is entered.
- ex) {NG: 0.2, OK: 0.8}
- result: The inference data judgment result (label)
-
When the number of inference data is 2 or more
- It is the same as train_summary.yaml.
- Regression Not supported. The message is:
- result: 'regression'.
- score: 0.
- note: 'Not supported - 24/03/12'.
- probability: {}
TCR Version: 2.2.1