TAD Features
Feature Overview
TAD's pipeline
The pipeline of AI Contents consists of a combination of assets, which are functional units. The Train pipeline
and Inference pipeline
are composed of a combination of 5 assets.
Train pipeline
Input - Readiness - Preprocess - Train
Inference pipeline
Input - Readiness - Preprocess - Inference
Each step is distinguished as an asset.
Train Pipeline
1. Input Asset
It reads all files in the path specified by the user in experimental_plan.yaml
and creates a single dataframe. The data path is specified in load_train_data_path
, and all files in the folder are read and merged.
2. Readiness Asset
It checks if the data is suitable for TAD modeling. Here, it classifies the column types of the data and verifies the minimum required amount of data. It checks whether all columns of the data are properly classified as numeric or categorical, and whether the proportion of missing values is not high.
3. Preprocess Asset
It performs data preprocessing tasks. This includes handling missing values, encoding categorical columns, scaling numeric data, and removing outliers. The default preprocessing method is to fill categorical columns with the most frequent value and numeric columns with the median value if there are missing values.
4. Train Asset
TAD performs Hyperparameter Optimization (HPO) with 5 built-in models (isolation forest, knn, local outlier factor, one-class SVM, dbscan), selects the optimal model, and trains it. HPO uses StratifiedKFold to divide the dataset into multiple folds, trains the model on each fold, and evaluates its performance. Through this process, it finds the optimal parameters.
HPO Feature Provision
TAD's HPO (Hyperparameter Optimization) feature can be turned ON/OFF through parameters. When performing HPO, the following settings are used to find the optimal parameters for each model:
- KNN: An anomaly detection technique that identifies outliers based on the information of K nearest neighbors.
n_neighbors
: Number of neighbors, searched as an integer between 3 and 30.
- OCSVM: A technique that determines anomalies in high-dimensional space.
kernel
: Kernel type, chosen from 'rbf', 'poly', 'sigmoid', 'linear'.degree
: Degree of polynomial kernel, searched as an integer between 2 and 10.
- LOF: A technique that detects outliers using the ratio of distances between nearest neighbors.
n_neighbors
: Number of neighbors, searched as an integer between 2 and 30.
- Isolation Forest: A technique that detects outliers based on the length of isolation paths in the data.
n_estimators
: Number of trees, searched in increments of 50 between 50 and 500.max_samples
: Sampling ratio for each tree, searched as a float between 0.5 and 1.0.
- DBSCAN: A density-based clustering technique that detects points in low-density regions as outliers.
eps
: Maximum distance between two samples, searched as a float between 0.1 and 10.min_samples
: Minimum number of samples required to form a cluster, searched as an integer between 2 and 30.
- Hyperparameter optimization for each model is performed through cross-validation on the given dataset. It uses StratifiedKFold to divide the dataset into multiple folds, trains the model on each fold, and evaluates its performance. During this process, it calculates the anomaly detection score of the model and performs hyperparameter search to maximize the outlier score.
- Finally, it compares model performance using the average IQR calculated from each fold to find the optimal parameters. After finding the optimal parameters for each model through HPO, it provides the optimal value by ensembling the final results of the 4 models.
Contamination Ratio Search Feature
To improve the performance of the anomaly detection model, it provides a feature that can automatically search for the ratio of outliers in the data. This feature helps maximize model performance even when users don't know the outlier ratio. It takes contamination
as an argument to explore various ratios and find the optimal ratio.
- contamination: Sets the range of outlier ratios. Enter as a list for Search Range or as a single integer. (Ex, [0.001, 0.2] or 0.01) Through this feature, you can experiment with various outlier ratios to find the optimal contamination value and improve the model's accuracy.
Visualization Feature Provision
TAD provides a feature to visualize anomaly detection results. This feature helps to visually confirm detected outliers and easily understand the data distribution and location of outliers.
- It visualizes the distribution of actual data and predicted outliers.
- If there's a y_column, it also visualizes the actual outliers.
- visualization: Enter as Boolean Type. (True or False)
Inference Pipeline
1. Input Asset
The Inference pipeline also has a data input step, which loads files from the inference data path. The data path is specified in load_inference_data_path
.
2. Readiness Asset
In the Inference pipeline, the readiness asset checks the suitability of the inference data. It checks if there are new values in the categorical columns used during training, and if such new values exist, it raises an error so that users can recognize and handle this.
3. Preprocess Asset
It uses the preprocessing model created in the preprocess asset of the Train pipeline to preprocess the inference data. It applies the same preprocessing methods used during training to maintain consistency.
4. Inference Asset
The Inference asset loads the model trained in the train asset, processes the inference data, and returns the results. This allows for anomaly detection on new data.
Visualization Feature Provision
TAD provides a feature to visualize anomaly detection results.
This feature helps to visually confirm detected outliers and easily understand the data distribution and location of outliers.
It's automatically applied when set to True in Train.
Usage Tips
- After running TAD's auto mode, check the log file to review the entire process. The log file is saved in '*_artifacts/log/pipeline.log'.
TAD Version: 1.0.0