Version: Next

GCR Input and Artifacts

Updated 2024.05.05

Data Preparation

Preparing Training Data

To use GCR, you need tabular data with columns for features (x columns) and a label column (y column) for training. Currently, only .csv file format is supported.

Preparing Inference Data

The inference dataset for GCR should have the same structure as the training dataset, except it does not require a label column. The .csv file format is also used for inference data.

Example of Training and Inference Datasets

When you clone the GCR AI content from the git repository, a sample_data directory is available in the root directory, where you can find the train and inference sets. The sample data has the following characteristics:

Purpose: Classifying whether an adult's income exceeds $50,000.
Data: Information about adults, including various attributes and income information (32,561 rows with missing values in some variables).
Source: UCI Machine Learning Repository - Adult Data Set

Below is an example of the training dataset. The label column target is included in the training set but not in the inference set.

Training Dataset Example:

idx	age	workclass	sex	...	ID	target
0	39	State-gov	Male	...	CX0	1
1	50	Self-emp-not-inc	Female	...	CX1	1
...	...	...	...	...	...	...
26070	58	Private	Female	...	CX32558	0
26071	22	Private	Male	...	CX32559	1

Inference Dataset Example:

idx	age	workclass	sex	...	ID
0	52	Self-emp-not-inc	Male	...	CX7
1	31	Private	Female	...	CX8
...	...	...	...	...	...
6487	27	Private	Female	...	CX32556
6488	52	Self-emp-inc	Female	...	CX32560

Example of Input Data Directory Structure

To use ALO, the train and inference files must be separated. Please separate the data for training and inference as shown below.
All files under a single folder are combined into one dataframe by the input asset and used for modeling. (Files in subfolders are also combined.)
All data files in a folder must have identical columns.

./{train_folder}/
	└ train_data1.csv
 	└ train_data2.csv
  	└ train_data3.csv
./{inference_folder}/
 	└ inference_data1.csv
 	└ inference_data2.csv
  	└ inference_data3.csv

Data Requirements

Mandatory Requirements

The input data must meet the following conditions. Refer to the data example above for guidance.

Dataset	Item	Note
train set, inference set	A single pipeline input receives one table as input.	The train pipeline requires one train set table, and the inference pipeline requires one inference set table.
train set	The train pipeline input table needs a label column (target column) for training, which must not have missing values.	The train set must contain a label column as it serves as input to the ML model.
train set, inference set	The train set and inference set must have the same column structure.

Additional Requirements

It is recommended that the train and inference sets meet the following conditions. If these conditions are not met, preprocessing is performed internally.

Item	Note
Remove or replace spaces and other whitespace characters.	Spaces are automatically replaced, so there is no problem with model operation. e.g., 베스트 샵 -> 베스트_샵
Using primarily categorical columns improves speed and performance.	When continuous variables are included, selectivity increases, which may not significantly impact embedding. That is, when the values of variables are highly diverse, the chance of the same value appearing in other rows is low, reducing the chance for these rows to be connected and interpreted as part of the graph, thus not aiding in performance improvement.

Artifacts

Running the training or inference pipeline will generate the following artifacts.

Train Pipeline

./alo/train_artifacts/
    └ log/
        └ experimental_history.json
        └ pipeline.log
        └ process.log
    └ models/
        └ input/
            └ input_config.json
        └ readiness/
            └ train_config.pickle
        └ train/
            └ global_feature_importance.csv
            └ graph_embedding.pickle
            └ hparams.json
            └ label_encoder.pkl
            └ lime_explainer.pk
            └ model.json

Inference Pipeline

./alo/inference_artifacts/
    └ log/
        └ experimental_history.json
        └ pipeline.log
        └ process.log
    └ output/
        └ output.csv
    └ score/
        └ inference_summary.yaml

Detailed descriptions of each artifact are as follows:

experimental_history.json

Records changes in training data, code, and parameters.

pipeline.log

Records GCR logs for each asset in the pipeline.

process.log

Records logs for the entire ALO process execution.

input_config.json

Stores user arguments from the train pipeline for reference in the inference pipeline to avoid re-entering the same arguments.

train_config.json

Contains x column information from the train pipeline input data for validation in the inference pipeline to ensure the same x column set exists in both pipelines.

global_feature_importance.csv

Contains feature importance values calculated using all samples in the train set.

graph_embedding.pickle

File storing graph embeddings (vectors).

hparams.json

File containing hyperparameters determined through HPO during training.

label_encoder.pkl

Encodes labels into integers for internal use during model training. The information is used to restore the original labels during inference.

lime_explainer.pk

LIME surrogate model trained during the train pipeline.

model.json

Contains the trained XGboost or flexible DNN model.

output.csv

Inference results with final prediction and local XAI results added to the original inference set.

inference_summary.yaml

inference_summary.yaml is displayed in the UI of Edge Conductor. It is generated using ALO's save_summary API. Detailed information is contained in ALO API: save_summary. The inference_summary.yaml generated as a result of the inference pipeline includes the following information.

Classification

result: Name of the class in the y column used for HPO evaluation. For example, label 1.
score: Average probability value of instances predicted as the result label. A low value indicates that the model may need retraining.
note:
probability:

Regression

If the label is not present in the inference set, we are working on identifying indicators for model retraining.

result: 'regression'
score:
note:
probability:

Data Preparation​

Preparing Training Data​

Preparing Inference Data​

Example of Training and Inference Datasets​

Example of Input Data Directory Structure​

Data Requirements​

Mandatory Requirements​

Additional Requirements​

Artifacts​

Train Pipeline​

Inference Pipeline​

experimental_history.json​

pipeline.log​

process.log​

input_config.json​

train_config.json​

global_feature_importance.csv​

graph_embedding.pickle​

hparams.json​

label_encoder.pkl​

lime_explainer.pk​

model.json​

output.csv​

inference_summary.yaml​