Skip to main content
Version: Next

TCR Input and Artifacts

Updated 2024.08.07

Data Preparation

Prepare your training data

Basically, to use TCR, you need to prepare tabular data with columns (x columns) and label columns (y columns) to be used for training. Currently, only the '.csv' file format is available.  

Training dataset example

x_0x_1...x_9y
valuevalue...valueclass
...............

Input data directory structure example

  • In order to use ALO, the train and inference files must be separated. Please distinguish between the data to be used for training and the data to be used for inference, as shown below.
  • All files under a single folder are aggregated from an input asset into a single dataframe that is then used for modeling. (Files in subfolders under the path are also combined.)
  • The columns of data in a folder must all be the same.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv

Data Requirements

Essential Requirements

The input data must meet the following conditions:

indexitemspec.
1Number of Y Columns1 piece
2Number of Classes in Column Y2 or more

Conditions to ensure minimum performance. Even if the following conditions are not met, the algorithm will work, but performance is difficult to guarantee.

indexitemspec.Remarks
1The minimum number of training data must be met.classification: class 30 stars or more / regression: 100 stars or moreThe default value can be adjusted to the user argument min_rows, but it is recommended to use more than the default value to ensure model performance.
2During inference, categorical data does not contain data that was not used for training.YesCurrently, you can adjust this action with user argument ignore_new_category, but we recommend retraining when this happens.


Deliverables  

Training/inference will produce the following outputs:

Train pipeline

./alo/train_artifacts/
└ log/pipeline.log
└ models/train/
└ best_model.pkl
└ model_selection.json
└ model_selection.csv
└ ...
└ output/
└ output.csv
└ extra_output
└ readiness
└ report.csv
└ train
└ eval_result.csv
└ summary_plot.png (when using the Shapley Value output function)

Inference pipeline

./alo/inference_artifacts/
└ log/pipeline.log
└ output/inference/
└ output.csv
└ extra_output
└ readiness/
└ report.csv
└ inference/
└ eval_result.csv (if there is a Y label)
└ summary_plot.png (when using the Shapley Value output function)
   └ score/
└ inference_summary.yaml

The detailed description of each deliverable is as follows.

pipeline.log

TCR log for each asset of the pipeline is recorded. Below is an example of pipeline.log's readiness asset log. If you check the contents of the log file, you can check the methodology applied for each asset.

# This is the result of the readiness asset of the sample_data/train_titanic titanic data of TCR.
...
[2024-04-09 01:53:46,045|USER|INFO|readiness.py(654)|save_info()] Pclass columns are classified as numeric columns.
[2024-04-09 01:53:46,055|USER|WARNING|readiness.py(657)|save_warning()] Column Name has more than 50 unique data and is excluded from x_columns.
[2024-04-09 01:53:46,063|USER|INFO|readiness.py(654)|save_info()] The Sex column is classified as a categorical column.
[2024-04-09 01:53:46,072|USER|INFO|readiness.py(654)|save_info()] Age columns are classified as numeric columns.
[2024-04-09 01:53:46,081|USER|INFO|readiness.py(654)|save_info()] SibSp columns are classified as numeric columns.
[2024-04-09 01:53:46,089|USER|INFO|readiness.py(654)|save_info()] Parch columns are classified as numeric columns.
[2024-04-09 01:53:46,099|USER|WARNING|readiness.py(657)|save_warning()] Column Tickets are excluded from the x_columns because the number of unique data exceeds 50.
[2024-04-09 01:53:46,107|USER|INFO|readiness.py(654)|save_info()] Fare columns are classified as numeric columns.
[2024-04-09 01:53:46,116|USER|WARNING|readiness.py(657)|save_warning()] column Exclude from the x_columns the number of unique data in the Cabin exceeds 50.
[2024-04-09 01:53:46,126|USER|INFO|readiness.py(654)|save_info()] Embarked columns are classified as categorical columns.
[2024-04-09 01:53:46,135|USER|INFO|readiness.py(654)|save_info()] of the training columns ['Sex', 'Embarked'] have been categorical.
[2024-04-09 01:53:46,144|The ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] column in the USER|INFO|readiness.py(654)|save_info()] training column has been categorized as numeric columns.
[2024-04-09 01:53:46,153|USER|INFO|readiness.py(654)|save_info()] The Survived column is classified as a categorical column.
...

best_model.pkl

It is a model file trained on the entire training data with the best parameter selected through HPO.

model_selection.json

HPO-related information. The details are as follows.

  • best_model_info: HPO's best model and parameter
  • data_split: Data split method applied during HPO
  • evaluation_metric: Evaluation metric
  • target_label: y-class as HPO standard
  • x_columns: Subject to x-column
  • model_score: HPO's model candidate. Model name and parameter filled in

model_selection.csv

model_selection.json have been converted to CSV for easier viewing. model_selection.json and content are the same.

output.csv

The output.csv contains the data forecast values of the train/inference result model, and the probability of classification.

  • origin_index column: The index of the user-entered dataframe.
  • prep_{x columns}: x columns preprocessed by the preprocess asset. x number of columns.
  • TCR-shap_{x column}: Generated when the shapley value feature is enabled. There is a Shapley Value value in the training data. x number of columns.
  • TCR-prob_{y column class name}: This means the probability that the model will classify the y column into a specific class. Columns are created by the number of classes in the y column.
  • TCR-pred_{y column}: The prediction result of the model.

report.csv

This is a table that summarizes information about the data as a result of the readiness asset. The first row of report.csv means information about the entire data. ignore_new_category: If true brings in category data that was not used for training in inference, a 'new-categories' column is created.

  • varname column: The name of the variable.
  • Role columns: Displays the columns used for training.
  • dtype: Indicates the type of data.
  • TCR-column-type: Indicates the type of data. (categorical/numeric)
  • categorical-info: displays the category information in the categorical column.
  • cardinality: The unique value of the categorical column.
  • numeric-info-min, max, mean, median: Statistics for numeric columns.
  • missing-values-num: The number of missing values.
  • missing-values-ratio: The missing ratio.
  • new-categories(inference, ignore_new_category: True): Displays category data that is not used for training.

eval_result.csv

The eval_result.csv records the evaluation metric for each class. In inference, it only works if a y-column exists.

  • label column: The label value of the y column is recorded.
  • Accuracy column: The accuracy score is recorded. (Common to all labels)
  • F1-Score column: Contains the F1 score for each label.
  • Recall column: The recall value for each label is recorded.
  • Precision columns: Precision values are recorded for each label.

summary_plot.png

The Shapley Value summary plot that is generated when you use the Shapley Value feature.

  • Indicates how much data in each x column affected the class in the y column.
  • Calculate the impact within the mean by putting an absolute value on Shapley Value.

train_summary.yaml

train_summary.yaml is the yaml file that is shown in the UI in Edge Conductor. I created it using ALO's save_summary API. For more information, see [ALO API: save_summary](.. /.. /alo/alo-v2/appendix_alo_api#save_summary)Please check the guide. The train_summary.yaml generated by the train pipeline contains the following information:

  1. Classification
  • Distinguish between result: Binary/Multi. ('Binary-Classification' / 'Multi-Classification')
  • Score: This is a confidence score using entropy.
  • Note: This is an explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'
  • probability: {}
  1. Regression
  • result: 'regression'
  • score: R2 (coefficient of decision).
  • Note: This is the interpretation of the results for the calculated score.
  • Coefficient of Decision score < 0
  • WARNING!! R2 < 0: R2 had returned minus value. Reconsider your model or data
  • Coefficient of Determination score = 1
  • WARNING!! R2 = 1: The model may be overfitted to the training data. Reconsider your data
  • 0 < coefficient of decision score < 1
  • The independent variables(x_columns) you used and the model account for about {score}% of the data
  • probability: {}

inference_summary.yaml 

  1. Classification
  • When the number of inference data is 1
  • Result: Inference Data (Label)
  • ex) result: OK
  • Score: This is a confidence score using entropy.
  • Note: This is an explanation of the score. 'Confidence Score (The closer to 1, the better the predictive performance of the model)'
  • Probability: {Label: The model's label classification probability} is filled in.
  • ex) {NG: 0.2, OK: 0.8}
  • When the number of inference data is more than 2
  • Same as train_summary.yaml.
  1. Regression It does not support it. The text is as follows.
  • result: 'regression'
  • score: 0
  • note: 'Not supported - 24/03/12'
  • probability: {}

TCR Version: 3.0.0