Skip to main content
Version: Next

GCR Features

Updated 2024.05.17

Feature Overview

Pipelines in GCR

The AI Contents pipeline consists of a combination of assets, which are functional units. The Train pipeline consists of 5 assets, while the Inference pipeline consists of 4 assets.

Train Pipeline

	Input - Readiness - Graph - Train - Output

Inference Pipeline

	Input - Readiness - Inference - Output

Input Asset

Reads files from the path specified by the user in experimental_plan.yaml and converts them into a dataframe.

Readiness Asset

Checks if the data provided by the user is suitable for GCR modeling.

Graph Asset

Transforms tabular data into a graph and performs graph embedding. The generated graph embeddings are used for training and inference in the respective assets.

Train Asset

Uses the graph embeddings from the Graph asset to train the prediction model.

Inference Asset

Generates prediction results and global/local XAI results using the graph embeddings from the Graph asset and the trained model from the Train asset.

Output Asset

Saves the outputs from the Train/Inference assets to the standard ALO location.



Usage Tips

Minimizing Memory Usage with Large Data (Avoiding Mellerikat Registration Failure Due to Memory Shortage)

GCR, based on graph-powered machine learning models, requires more memory than typical classification/regression models, which can cause memory shortage issues when processing large input data. To address this, GCR provides the following memory optimization techniques:

  1. Graph Asset: Reducing the Dimension of Graph Embeddings
  • The default dimension of graph embeddings (argument name: dimension) is set to 32. Reducing this value to 16, 8, etc., can significantly reduce memory usage, as each original data column is embedded into a vector with 16 or 8 dimensions instead of 32. However, it is not recommended to use dimensions less than 8 to maintain the discrimination power of the embeddings.
  1. Graph Asset: Embedding the Graph in Multiple Partitions
  • GCR allows graph embedding by splitting the data into multiple partitions instead of embedding the entire dataset at once, reducing peak memory requirements. This is controlled by the num_partitions argument. The default value is 1, meaning the entire data is embedded at once. Increasing this value to 2, 4, 8, 16, 32, etc., divides the data into partitions for embedding.
  • Note that increasing the number of partitions extends the overall embedding time and reduces the impact on peak memory usage incrementally. Therefore, it is most efficient to use the smallest num_partitions value that avoids memory shortage.

As a reference, embedding a dataset with 35 columns and 91,844 rows (25MB) using the default settings (dimension 32, num_partitions 1) requires approximately 11GB of peak memory. On a local server with approximately 55GB of memory, a dataset with 27 columns and 986,406 rows (288MB) was successfully embedded in about an hour using dimension 8 and num_partitions 16.



Detailed Features

Train Pipeline: Input Asset

The Input asset reads all files from the user-specified data path in experimental_plan.yaml and combines them into a single dataframe. The user data path is specified in the load_train_data_path and load_inference_data_path entries in experimental_plan.yaml. The path should be a folder path excluding file names. In the Train pipeline, data from the load_train_data_path is read.

external_path:
- load_train_data_path: ./solution/sample_data/train
- load_inference_data_path: ./solution/sample_data/test
- save_train_artifacts_path:
- save_inference_artifacts_path:
  • If there are subfolders within the specified path, data from these subfolders are also read and combined.
  • All files in the specified path must have identical columns.
  • For detailed descriptions of experimental_plan.yaml and input asset parameters, please refer to GCR Parameters.

Train Pipeline: Readiness Asset

The Readiness asset checks if the data used for training/inference is suitable for GCR modeling. As a graph-powered machine learning model, GCR can handle missing values and many categorical data without additional preprocessing, making data quality requirements relatively simple and lightweight. The Readiness asset performs necessary checks for both the train and inference pipelines, with specific items checked for each pipeline as detailed below.

Checklist

  1. Verify that column names specified in experimental_plan.yaml exist in the data

The Readiness asset checks if the column names provided by the user exist in the dataframe. The following arguments are verified. Detailed usage of each argument can be found in GCR Parameters.

  • x_columns: Column names to be used for training. If left blank, all columns except y_column are used.
  • drop_columns: Column names to be excluded from training. If left blank, no columns are excluded.
  • y_column: Label column
  1. Check for missing values in the label column (y_column)

Although GCR can handle missing values in x_columns without additional measures, the label column must not have missing values as it is a supervised learning model.

Train Pipeline: Graph Asset

The Graph asset converts input data into a graph representation and performs graph representation learning (graph embedding), creating embedding vectors that include useful information hidden in the data. This asset is crucial for providing the unique advantages of the graph-powered machine learning model, GCR.

GCR's graph asset is designed to support practical graph data science with the following functionalities.

Graph Embedding Algorithms

  • GCR uses Pytorch BigGraph (PBG) for graph embedding, which fundamentally aims to make nodes sharing the same edge have similar vector values in the vector space, while nodes that do not share an edge have dissimilar vector values.
  • PBG represents all nodes and edges as vectors and defines a scoring function to calculate the similarity between source and destination node vectors. It optimizes the scoring function values across all pairs of nodes in the graph using a global loss function.

Graph Partitioning

  • Based on PBG, GCR can embed graphs by splitting them into multiple partitions, allowing the embedding of extremely large graphs with billions of nodes and trillions of edges under realistic memory constraints.
  • To align all nodes within a single vector space while embedding them in partitions, GCR 1) splits the source and destination nodes without overlap and 2) embeds the partitions in an 'inside-out' order.
  • Each partition has pre-trained nodes that are fine-tuned, considering samples not visible in the current partition, enabling more comprehensive node embedding.

Inductive Learning

  • Unlike general Graph Embedding methods that perform training and inference simultaneously, GCR applies an Inductive Learning method to enhance inference speed and reduce resources.
  • Graph embedding is performed only once in the train pipeline, and sub-graph embedding is generated in downstream tasks by connecting the node embeddings to new virtual nodes, eliminating the need to re-train the graph for each new inference dataset.

Train Pipeline: Train Asset

GCR's train/inference assets include two built-in models. The list of models and parameter sets are as follows:

Built-in GCR Models

  • XGboost: Supported GCR versions = v2.1.0, v3.1.0

    • num_boost_round: Determined by HPO, range 10~1000
    • eta: Determined by HPO, range 0.01~0.2
    • gamma: Determined by HPO, range 0~5
    • lambda: Determined by HPO, range 0~5
    • alpha: Determined by HPO, range 0~5
    • tree_method: hist
    • max_depth: 6
    • min_child_weight: 1
    • sampling_method: uniform
    • subsample: 1
    • colsample_bytree: 1
    • colsample_bylevel: 1
    • colsample_bynode: 1
    • scale_pos_weight: 1
    • base_score: 0.5
  • Flexible DNN: Supported GCR versions = v3.0.0, v3.1.0

    • dropout: 0.5
    • learning_rate: 0.0001
    • epochs: 10
    • batch_size: 64

HPO (Hyper-Parameter Optimization) Functionality

GCR's train asset uses the Optuna HPO library to perform stratified K-folds cross-validation (CV), determining the best model parameters before training the model on the entire train set with these parameters. Detailed HPO processes are as follows. For more information on HPO control methods, please refer to GCR Parameters.

  1. Data splitting for HPO
    • Stratified cross-validation is used to divide the train/validation sets. The default is 3-fold cross-validation.
  2. Comparing candidate model performance based on evaluation metrics
    • The default evaluation metric is f1_score for classification and rmse for regression. Users can specify the evaluation metric using the eval_metric argument.
  3. Re-training the selected model with the determined parameter settings on the entire training data

Inference Pipeline: Input Asset

Functions the same as in the Train pipeline, but reads data from the load_inference_data_path path in the Inference pipeline.

Inference Pipeline: Readiness Asset

Checklist

  1. Verify that column names specified in experimental_plan.yaml exist in the data

The Read

iness asset checks if the column names provided by the user exist in the dataframe. The following arguments are verified. Unlike the Train pipeline, the Inference pipeline does not check for the existence of the label column (y_column) or missing values in it. For detailed usage of Readiness asset arguments, please refer to GCR Parameters.

  • x_columns: Column names in the inference dataset. If left blank, all columns are used.
  • drop_columns: Column names to be excluded from the inference dataset. If left blank, no columns are excluded.
  1. Verify that x columns in the inference set match the x columns in the train set

The Inference asset checks if the x columns in the inference set match the x columns in the train set used for model training. The order of columns does not matter as long as all x columns are present.

Inference Pipeline: Inference Asset

The Inference asset uses the model trained in the Train pipeline to perform classification/regression inference on the inference dataset. The samples in the inference dataset are numerically represented using graph embeddings from the Graph asset, without additional graph embedding, and provided as input for model inference. Unlike typical graph data science ML models that require retraining the model for each new inference dataset, GCR is an inductive model that does not require retraining, enabling fast inference times.

XAI Functionality

GCR provides global XAI using the train set and local XAI for each sample in the inference set. These XAI functions can be enabled/disabled using the global_xai and local_xai arguments in the Inference asset.

When using the XGboost model, both global and local XAI are provided through the LIME library, with a specially designed Graph XAI wrapper that displays LIME results with original input data column names. Without this wrapper, LIME would require repeated cycles of graph embedding, model training, and inference for each input perturbation, demanding substantial time and resources. The Graph XAI wrapper aggregates graph embedding vectors by input data columns, enabling XAI results with just one graph embedding cycle, reducing graph XAI execution time by over 100 times.

When using the Flexible DNN model, local XAI is also provided through LIME with the same Graph XAI wrapper, offering similar time reduction. The global XAI for Flexible DNN does not use LIME but computes graph XAI during the DNN model training process using an attention-like method, offering even faster execution times.

Global XAI results are provided in global_feature_importance.csv located under train_artifacts/models/train in the working directory containing the main.py of ALO. Local XAI results are merged into the inference output file output.csv in inference_artifacts/output as new columns. This functionality is currently available only for classification tasks.

The following example shows the results for binary classification on an inference dataset with columns X1~X9:

Sample IndexclassificationResultScores for Each Label CategoryTop 5 Reasons (Column Names and Values)
000.77, 0.23X1=0.1, X3=0.7, X4='A', X5='S', X9=0.02
100.65, 0.35X3=0.6, X2=0.2, X1=0.7, X4='B', X8=0.01
210.83, 0.17X4='B', X5='P', X9=0.07, X7='S', X1=0.3

GCR Version: 3.0.0