Skip to main content
Version: Next

GCR Input and Artifacts

Updated 2024.05.05

Data Preparation

Preparing Training Data

To use GCR, you need tabular data with columns for features (x columns) and a label column (y column) for training. Currently, only .csv file format is supported.

Preparing Inference Data

The inference dataset for GCR should have the same structure as the training dataset, except it does not require a label column. The .csv file format is also used for inference data.

Example of Training and Inference Datasets

When you clone the GCR AI content from the git repository, a sample_data directory is available in the root directory, where you can find the train and inference sets. The sample data has the following characteristics:

  • Purpose: Classifying whether an adult's income exceeds $50,000.
  • Data: Information about adults, including various attributes and income information (32,561 rows with missing values in some variables).
  • Source: UCI Machine Learning Repository - Adult Data Set

Below is an example of the training dataset. The label column target is included in the training set but not in the inference set.

Training Dataset Example:

idxageworkclasssex...IDtarget
039State-govMale...CX01
150Self-emp-not-incFemale...CX11
.....................
2607058PrivateFemale...CX325580
2607122PrivateMale...CX325591

Inference Dataset Example:

idxageworkclasssex...ID
052Self-emp-not-incMale...CX7
131PrivateFemale...CX8
..................
648727PrivateFemale...CX32556
648852Self-emp-incFemale...CX32560

Example of Input Data Directory Structure

  • To use ALO, the train and inference files must be separated. Please separate the data for training and inference as shown below.
  • All files under a single folder are combined into one dataframe by the input asset and used for modeling. (Files in subfolders are also combined.)
  • All data files in a folder must have identical columns.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv


Data Requirements

Mandatory Requirements

The input data must meet the following conditions. Refer to the data example above for guidance.

DatasetItemNote
train set, inference setA single pipeline input receives one table as input.The train pipeline requires one train set table, and the inference pipeline requires one inference set table.
train setThe train pipeline input table needs a label column (target column) for training, which must not have missing values.The train set must contain a label column as it serves as input to the ML model.
train set, inference setThe train set and inference set must have the same column structure.

Additional Requirements

It is recommended that the train and inference sets meet the following conditions. If these conditions are not met, preprocessing is performed internally.

ItemNote
Remove or replace spaces and other whitespace characters.Spaces are automatically replaced, so there is no problem with model operation. e.g., 베스트 샵 -> 베스트_샵
Using primarily categorical columns improves speed and performance.When continuous variables are included, selectivity increases, which may not significantly impact embedding. That is, when the values of variables are highly diverse, the chance of the same value appearing in other rows is low, reducing the chance for these rows to be connected and interpreted as part of the graph, thus not aiding in performance improvement.


Artifacts

Running the training or inference pipeline will generate the following artifacts.

Train Pipeline

./alo/train_artifacts/
└ log/
└ experimental_history.json
└ pipeline.log
└ process.log
└ models/
└ input/
└ input_config.json
└ readiness/
└ train_config.pickle
└ train/
└ global_feature_importance.csv
└ graph_embedding.pickle
└ hparams.json
└ label_encoder.pkl
└ lime_explainer.pk
└ model.json

Inference Pipeline

./alo/inference_artifacts/
└ log/
└ experimental_history.json
└ pipeline.log
└ process.log
└ output/
└ output.csv
└ score/
└ inference_summary.yaml

Detailed descriptions of each artifact are as follows:

experimental_history.json

Records changes in training data, code, and parameters.

pipeline.log

Records GCR logs for each asset in the pipeline.

process.log

Records logs for the entire ALO process execution.

input_config.json

Stores user arguments from the train pipeline for reference in the inference pipeline to avoid re-entering the same arguments.

train_config.json

Contains x column information from the train pipeline input data for validation in the inference pipeline to ensure the same x column set exists in both pipelines.

global_feature_importance.csv

Contains feature importance values calculated using all samples in the train set.

graph_embedding.pickle

File storing graph embeddings (vectors).

hparams.json

File containing hyperparameters determined through HPO during training.

label_encoder.pkl

Encodes labels into integers for internal use during model training. The information is used to restore the original labels during inference.

lime_explainer.pk

LIME surrogate model trained during the train pipeline.

model.json

Contains the trained XGboost or flexible DNN model.

output.csv

Inference results with final prediction and local XAI results added to the original inference set.

inference_summary.yaml

inference_summary.yaml is displayed in the UI of Edge Conductor. It is generated using ALO's save_summary API. Detailed information is contained in ALO API: save_summary. The inference_summary.yaml generated as a result of the inference pipeline includes the following information.

  1. Classification
  • result: Name of the class in the y column used for HPO evaluation. For example, label 1.
  • score: Average probability value of instances predicted as the result label. A low value indicates that the model may need retraining.
  • note:
  • probability:
  1. Regression

If the label is not present in the inference set, we are working on identifying indicators for model retraining.

  • result: 'regression'
  • score:
  • note:
  • probability: