GCR Input and Artifacts
Data Preparation
Preparing Training Data
To use GCR, you need tabular data with columns for features (x columns) and a label column (y column) for training. Currently, only .csv
file format is supported.
Preparing Inference Data
The inference dataset for GCR should have the same structure as the training dataset, except it does not require a label column. The .csv
file format is also used for inference data.
Example of Training and Inference Datasets
When you clone the GCR AI content from the git repository, a sample_data
directory is available in the root directory, where you can find the train and inference sets. The sample data has the following characteristics:
- Purpose: Classifying whether an adult's income exceeds $50,000.
- Data: Information about adults, including various attributes and income information (32,561 rows with missing values in some variables).
- Source: UCI Machine Learning Repository - Adult Data Set
Below is an example of the training dataset. The label column target is included in the training set but not in the inference set.
Training Dataset Example:
idx | age | workclass | sex | ... | ID | target |
---|---|---|---|---|---|---|
0 | 39 | State-gov | Male | ... | CX0 | 1 |
1 | 50 | Self-emp-not-inc | Female | ... | CX1 | 1 |
... | ... | ... | ... | ... | ... | ... |
26070 | 58 | Private | Female | ... | CX32558 | 0 |
26071 | 22 | Private | Male | ... | CX32559 | 1 |
Inference Dataset Example:
idx | age | workclass | sex | ... | ID |
---|---|---|---|---|---|
0 | 52 | Self-emp-not-inc | Male | ... | CX7 |
1 | 31 | Private | Female | ... | CX8 |
... | ... | ... | ... | ... | ... |
6487 | 27 | Private | Female | ... | CX32556 |
6488 | 52 | Self-emp-inc | Female | ... | CX32560 |
Example of Input Data Directory Structure
- To use ALO, the train and inference files must be separated. Please separate the data for training and inference as shown below.
- All files under a single folder are combined into one dataframe by the input asset and used for modeling. (Files in subfolders are also combined.)
- All data files in a folder must have identical columns.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv
Data Requirements
Mandatory Requirements
The input data must meet the following conditions. Refer to the data example above for guidance.
Dataset | Item | Note |
---|---|---|
train set, inference set | A single pipeline input receives one table as input. | The train pipeline requires one train set table, and the inference pipeline requires one inference set table. |
train set | The train pipeline input table needs a label column (target column) for training, which must not have missing values. | The train set must contain a label column as it serves as input to the ML model. |
train set, inference set | The train set and inference set must have the same column structure. |
Additional Requirements
It is recommended that the train and inference sets meet the following conditions. If these conditions are not met, preprocessing is performed internally.
Item | Note |
---|---|
Remove or replace spaces and other whitespace characters. | Spaces are automatically replaced, so there is no problem with model operation. e.g., 베스트 샵 -> 베스트_샵 |
Using primarily categorical columns improves speed and performance. | When continuous variables are included, selectivity increases, which may not significantly impact embedding. That is, when the values of variables are highly diverse, the chance of the same value appearing in other rows is low, reducing the chance for these rows to be connected and interpreted as part of the graph, thus not aiding in performance improvement. |
Artifacts
Running the training or inference pipeline will generate the following artifacts.
Train Pipeline
./alo/train_artifacts/
└ log/
└ experimental_history.json
└ pipeline.log
└ process.log
└ models/
└ input/
└ input_config.json
└ readiness/
└ train_config.pickle
└ train/
└ global_feature_importance.csv
└ graph_embedding.pickle
└ hparams.json
└ label_encoder.pkl
└ lime_explainer.pk
└ model.json
Inference Pipeline
./alo/inference_artifacts/
└ log/
└ experimental_history.json
└ pipeline.log
└ process.log
└ output/
└ output.csv
└ score/
└ inference_summary.yaml
Detailed descriptions of each artifact are as follows:
experimental_history.json
Records changes in training data, code, and parameters.
pipeline.log
Records GCR logs for each asset in the pipeline.
process.log
Records logs for the entire ALO process execution.
input_config.json
Stores user arguments from the train pipeline for reference in the inference pipeline to avoid re-entering the same arguments.
train_config.json
Contains x column information from the train pipeline input data for validation in the inference pipeline to ensure the same x column set exists in both pipelines.
global_feature_importance.csv
Contains feature importance values calculated using all samples in the train set.
graph_embedding.pickle
File storing graph embeddings (vectors).
hparams.json
File containing hyperparameters determined through HPO during training.
label_encoder.pkl
Encodes labels into integers for internal use during model training. The information is used to restore the original labels during inference.
lime_explainer.pk
LIME surrogate model trained during the train pipeline.
model.json
Contains the trained XGboost or flexible DNN model.
output.csv
Inference results with final prediction and local XAI results added to the original inference set.
inference_summary.yaml
inference_summary.yaml
is displayed in the UI of Edge Conductor. It is generated using ALO's save_summary API. Detailed information is contained in ALO API: save_summary. The inference_summary.yaml
generated as a result of the inference pipeline includes the following information.
- Classification
- result: Name of the class in the y column used for HPO evaluation. For example, label 1.
- score: Average probability value of instances predicted as the result label. A low value indicates that the model may need retraining.
- note:
- probability:
- Regression
If the label is not present in the inference set, we are working on identifying indicators for model retraining.
- result: 'regression'
- score:
- note:
- probability: