Version: Next

TAD Features

Updated 2024.07.06

Feature Overview

TAD's pipeline

The pipeline of AI Contents consists of a combination of assets, which are functional units. The Train pipeline and Inference pipeline are composed of a combination of 5 assets.

Train pipeline

    Input - Readiness - Preprocess - Train

Inference pipeline

    Input - Readiness - Preprocess - Inference

Each step is distinguished as an asset.

Train Pipeline

1. Input Asset

It reads all files in the path specified by the user in experimental_plan.yaml and creates a single dataframe. The data path is specified in load_train_data_path, and all files in the folder are read and merged.

2. Readiness Asset

It checks if the data is suitable for TAD modeling. Here, it classifies the column types of the data and verifies the minimum required amount of data. It checks whether all columns of the data are properly classified as numeric or categorical, and whether the proportion of missing values is not high.

3. Preprocess Asset

It performs data preprocessing tasks. This includes handling missing values, encoding categorical columns, scaling numeric data, and removing outliers. The default preprocessing method is to fill categorical columns with the most frequent value and numeric columns with the median value if there are missing values.

4. Train Asset

TAD performs Hyperparameter Optimization (HPO) with 5 built-in models (isolation forest, knn, local outlier factor, one-class SVM, dbscan), selects the optimal model, and trains it. HPO uses StratifiedKFold to divide the dataset into multiple folds, trains the model on each fold, and evaluates its performance. Through this process, it finds the optimal parameters.

HPO Feature Provision

TAD's HPO (Hyperparameter Optimization) feature can be turned ON/OFF through parameters. When performing HPO, the following settings are used to find the optimal parameters for each model:

KNN: An anomaly detection technique that identifies outliers based on the information of K nearest neighbors.
- n_neighbors: Number of neighbors, searched as an integer between 3 and 30.
OCSVM: A technique that determines anomalies in high-dimensional space.
- kernel: Kernel type, chosen from 'rbf', 'poly', 'sigmoid', 'linear'.
- degree: Degree of polynomial kernel, searched as an integer between 2 and 10.
LOF: A technique that detects outliers using the ratio of distances between nearest neighbors.
- n_neighbors: Number of neighbors, searched as an integer between 2 and 30.
Isolation Forest: A technique that detects outliers based on the length of isolation paths in the data.
- n_estimators: Number of trees, searched in increments of 50 between 50 and 500.
- max_samples: Sampling ratio for each tree, searched as a float between 0.5 and 1.0.
DBSCAN: A density-based clustering technique that detects points in low-density regions as outliers.
- eps: Maximum distance between two samples, searched as a float between 0.1 and 10.
- min_samples: Minimum number of samples required to form a cluster, searched as an integer between 2 and 30.

Hyperparameter optimization for each model is performed through cross-validation on the given dataset. It uses StratifiedKFold to divide the dataset into multiple folds, trains the model on each fold, and evaluates its performance. During this process, it calculates the anomaly detection score of the model and performs hyperparameter search to maximize the outlier score.
Finally, it compares model performance using the average IQR calculated from each fold to find the optimal parameters. After finding the optimal parameters for each model through HPO, it provides the optimal value by ensembling the final results of the 4 models.

Contamination Ratio Search Feature

To improve the performance of the anomaly detection model, it provides a feature that can automatically search for the ratio of outliers in the data. This feature helps maximize model performance even when users don't know the outlier ratio. It takes contamination as an argument to explore various ratios and find the optimal ratio.

contamination: Sets the range of outlier ratios. Enter as a list for Search Range or as a single integer. (Ex, [0.001, 0.2] or 0.01) Through this feature, you can experiment with various outlier ratios to find the optimal contamination value and improve the model's accuracy.

Visualization Feature Provision

TAD provides a feature to visualize anomaly detection results. This feature helps to visually confirm detected outliers and easily understand the data distribution and location of outliers.

It visualizes the distribution of actual data and predicted outliers.
If there's a y_column, it also visualizes the actual outliers.
visualization: Enter as Boolean Type. (True or False)

Inference Pipeline

1. Input Asset

The Inference pipeline also has a data input step, which loads files from the inference data path. The data path is specified in load_inference_data_path.

2. Readiness Asset

In the Inference pipeline, the readiness asset checks the suitability of the inference data. It checks if there are new values in the categorical columns used during training, and if such new values exist, it raises an error so that users can recognize and handle this.

3. Preprocess Asset

It uses the preprocessing model created in the preprocess asset of the Train pipeline to preprocess the inference data. It applies the same preprocessing methods used during training to maintain consistency.

4. Inference Asset

The Inference asset loads the model trained in the train asset, processes the inference data, and returns the results. This allows for anomaly detection on new data.

Visualization Feature Provision

TAD provides a feature to visualize anomaly detection results. This feature helps to visually confirm detected outliers and easily understand the data distribution and location of outliers.
It's automatically applied when set to True in Train.

Usage Tips

After running TAD's auto mode, check the log file to review the entire process. The log file is saved in '*_artifacts/log/pipeline.log'.

TAD Version: 1.0.0

Feature Overview​

TAD's pipeline​

Train pipeline​

Inference pipeline​

Train Pipeline​

1. Input Asset​

2. Readiness Asset​

3. Preprocess Asset​

4. Train Asset​

HPO Feature Provision​

Contamination Ratio Search Feature​

Visualization Feature Provision​

Inference Pipeline​

1. Input Asset​

2. Readiness Asset​

3. Preprocess Asset​

4. Inference Asset​

Visualization Feature Provision​

Usage Tips​

Feature Overview

TAD's pipeline

Train pipeline

Inference pipeline

Train Pipeline

1. Input Asset

2. Readiness Asset

3. Preprocess Asset

4. Train Asset

HPO Feature Provision

Contamination Ratio Search Feature

Visualization Feature Provision

Inference Pipeline

1. Input Asset

2. Readiness Asset

3. Preprocess Asset

4. Inference Asset

Visualization Feature Provision

Usage Tips