Skip to main content
Version: Next

TAD Input Data and Output Manual

Updated 2024.07.06

Data Preparation

To use TAD (Tabular Anomaly Detection), the following data preparation is necessary. TAD detects anomalies based on unsupervised learning, so a label column (y column) is not mandatory. However, preparing about 5% of anomalous data compared to the amount of normal data can help improve performance.

Preparing Training Data

  1. Prepare a csv file consisting of the data you want to detect anomalies in and the feature columns.
  2. The label column is optional.
  3. If grouping, there should be a column for grouping.

Training Dataset Example

x_col_1x_col_2time_col(optional)grouupkey(optional)y_col(optional)
value 1_1value 1_2time 1group1ok
value 2_1value2_2time 2group2ok
value 3_1value3_2time 3group1ng
...............

Input Data Directory Structure Example

  • To use ALO, train and inference files must be separated. Please separate the data for training and inference as shown below.
  • All files under one folder are combined into a single dataframe in the input asset and used for modeling. (Files in subfolders under the path are also combined.)
  • The columns of data in one folder must all be the same. Training Data Directory Structure Example:
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv

Validation Data Directory Structure Example:

./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv

The data to be used for training combines all .csv files within one folder to create a single dataframe. This dataframe is used for model training.

Data Requirements


Essential Requirements

  • Input data must satisfy the following conditions:
    • A y column (label) is not necessary.
    • The higher the proportion of normal data, the better. (It's fine to have only normal data.)

Additional Requirements

  • To ensure minimum performance, we recommend data where the proportion of normal data is greater than that of anomalous data.
  • If the data ratios are similar, we recommend proceeding after oversampling the normal data.

Outputs


When you run MLAD's training/inference, the following outputs are generated.

Train pipeline

./train_artifacts/
├── extra_output/
│ └── train/
│ └── eval.json
├── log/
│ ├── experimental_history.json
│ ├── pipeline.log
│ └── process.log
├── models/
│ ├── input/
│ │ └── input_config.json
│ ├── readiness/
│ │ └── train_config.pickle
│ │ └── report.csv
│ ├── preprocess/
│ │ └── preprocess_config.json
│ │ └── scaler.pickle
│ └── train/
│ ├── isf_default.pkl
│ ├── knn_default.pkl
│ ├── lof_default.pkl
│ ├── ocsvm_default.pkl
│ ├── dbscan_default.pkl
│ └── params.json
├── output/
│ └── train_pred.csv
└── score/
└── train_summary.yaml

Inference pipeline

./inference_artifacts/
├── extra_output/
│ └── readiness/
│ │ └── report.csv
│ └── inference/
│ └── eval.json
├── log/
│ ├── experimental_history.json
│ ├── pipeline.log
│ └── process.log
├── output/
│ └── test_pred.csv
└── score/
└── inference_summary.yaml

Detailed Explanation of Outputs

log/

  • pipeline.log: The entire log of the pipeline is recorded. You can check detailed logs of data preprocessing, model training, and inference processes.
  • process.log: This is the log of process execution.
  • experimental_history.json: Contains information about the experimental history.

models/

  • input/input_config.json: Config for data input is saved.
  • readiness/: Config for data readiness and report on target data are saved.
    • train_config.pkl: Readiness config
    • report.csv: Readiness report on input data
  • preprocess/: Config and preprocessing objects used for data preprocessing are saved.
    • scaler.pickle: Scaler object of training data
    • preprocess_config.json: Preprocess config
  • train/: These are model files trained using various algorithms.
    • isf_default.pkl: Isolation Forest model
    • knn_default.pkl: K-Nearest Neighbors model
    • lof_default.pkl: Local Outlier Factor model
    • ocsvm_default.pkl: One-Class SVM model
    • dbscan.pkl: DBSCAN model
    • params.json: Information on used model parameters

output/

  • train_pred.csv: Prediction results for training data are saved.
  • test_pred.csv: Prediction results for inference data are saved.

extra_output/

  • eval.json: Evaluation metrics and results are saved.

score/

  • train_summary.yaml: This is a summary file of training results. It includes HPO (Hyperparameter Optimization) results and model performance metrics.
  • inference_summary.yaml: This is a summary file of inference results. It includes inference results and evaluation metrics using the trained model.


TAD Version: 1.0.0