Graph-powered Classification/Regression (GCR)
What is Graph-powered Classification/Regression (GCR)?
GCR is an AI content for handling machine learning tasks such as classification and regression, widely used in data science. Unlike traditional classification/regression algorithms, GCR leverages graph representation learning to better learn the information contained in the given data, resulting in improved prediction performance.
Graph representation learning transforms data into a graph format composed of nodes and edges. It then converts the data values into vectors that machine learning algorithms can learn from, based on how similarly the nodes are connected within the graph. This technique enhances the data with useful information, making it more learnable by machine learning algorithms. It not only utilizes the values represented by nodes but also the relationships between the data values represented by edges, allowing the algorithms to learn from more useful information.
Such graph representation learning, also known as graph embedding, improves the performance of machine learning algorithms, which are then called graph-powered machine learning algorithms. The data science leveraging these algorithms is referred to as graph data science. GCR, as LG's first graph data science AI content for classification/regression tasks, incorporates innovative technologies to enable easy, fast, and resource-efficient application of graph data science, as detailed in the 'key features' section.
When to use Graph-powered Classification/Regression (GCR)?
GCR is used for various classification and regression modeling tasks involving tabular data, similar to TCR (Tabular Classification/Regression) AI content, and can be applied even when there are missing values without additional preprocessing. Specific application areas include:
- Finance: Used for credit rating classification, company bankruptcy prediction, etc. For example, a model can be created to classify customer credit ratings using variables such as personal information, transaction history, and credit records. Alternatively, a regression model can be created to predict bankruptcy by analyzing a company's financial information and market trends.
- Healthcare: Used for classifying the presence of specific diseases (e.g., cancer, diabetes) using patient medical records, genetic information, and biometric signals. This aids in early disease detection and treatment.
- Marketing: Used for customer segmentation classification, customer churn prediction, and advertising effect prediction. For instance, a model can be created to classify customer groups using variables like purchase history, website visit records, and personal information, with group label data existing for each customer. This can be utilized for customer management and marketing strategy development.
- Public Sector: Used for crime prediction, traffic volume prediction, and election result prediction. For example, a model can be created to classify the likelihood of crime occurrence in a specific area using variables such as regional demographics, past crime records, and economic conditions.
A real-world application example of GCR is illustrated below:
MQL Index
MQL (Marketing Qualified Lead) refers to a potential customer who has shown interest in the content provided by a brand's marketing activities or is more likely to convert into a customer compared to other potential leads. In this case, various reactions of visiting customers and the subsequent updated contract success status were organized into a table, used as features and labels for GCR training. As customers may not respond to all items or the questions presented to customers may change, the data contains missing values and the responses are mostly categorical. GCR could be applied without additional imputation for missing values or numerical conversion of categorical data, and the graph representation learning leveraged the relationships between data for more accurate modeling.
Key Features
High Prediction Accuracy and Ease of Handling Missing/Categorical Data with Graph Data Science
GCR uses graph data science technology that makes the connection patterns between factors within each sample reflect the sample's value (label or value). It leverages not only the values of the factors but also the relationship information, providing higher prediction accuracy compared to general ML models. Furthermore, due to the nature of graph data science, where different topologies (node and edge patterns) can be compared equally, samples containing missing or categorical values do not require additional preprocessing.
Practical Graph Data Science Content for Limited Memory and CPU Environments
Applying graph data science requires storing and processing additional relationship information between factors, demanding more memory and CPU than general ML models. However, GCR operates smoothly even in memory and CPU-limited environments, thanks to technologies such as graph partitioning, inductive inference, and lightweight XAI.
Support for Graph Modeling Using Domain Knowledge
The effectiveness of graph data science depends on the topology applied to the given data. GCR provides arguments to specify additional relationships between factors, enabling users to customize the topology based on their domain knowledge for more accurate prediction performance.
Global and Local XAI Functions
GCR provides both global and local XAI functions. Global XAI shows the feature importance of all factors calculated across all training samples, while local XAI displays the top 5 factors and their values influencing the prediction for each inference sample. The XAI results of GCR are influenced by the topology applied in graph data science.
Quick Start
Installation
- Install ALO. See details: Start ALO
- Use the following git address to install the content. See details: Use AI Contents (Lv.1)
- git url: https://github.com/mellerikat-aicontents/Graph-powered-Classification-Regression.git
Data Preparation
- Sample data (train set/inference set) is included when the content is installed.
Essential Parameter Settings
- Modify the data paths in
solution/experimental_plan.yaml
as follows:external_path:
- load_train_data_path: ./solution/sample_data/train/ # Change to user data path
- load_inference_data_path: ./solution/sample_data/test/ # Change to user data path - Input the parameters suitable for the data in the
readiness
step of thetrain_pipeline
:user_parameters:
- train_pipeline:
- step: readiness
args:
- x_columns: # Input the x column names of the user data as a list in the format ['A', 'B']. If left blank, all columns except y_column will be used.
- drop_columns: # Input the x column names not to be used in the user data as a list in the format ['A', 'B']. If left blank, it will be ignored.
- y_column: target # Input the y column name (i.e., label column) of the user data - Input the parameters suitable for the data in the
readiness
step of theinference_pipeline
: For inference, the y_column is not required.user_parameters:
- inference_pipeline:
- step: readiness
args:
- x_columns: # Input the x column names of the user data as a list in the format ['A', 'B']. If left blank, all columns except y_column will be used.
- drop_columns: # Input the x column names not to be used in the user data as a list in the format ['A', 'B']. If left blank, it will be ignored.
By setting only steps 1, 2, and 3 and running ALO, the remaining optional arguments will use default values.
Execution
- With the above settings, execute
python main.py
in the ALO installation directory to generate a Graph Classification model with default settings. For more detailed instructions on creating a model, refer to the following page. See details: GCR Parameters
Topics
GCR Version: 3.1.0, ALO Version: 2.3.1