Version: Next

TAD Input Data and Output Manual

Updated 2024.07.06

Data Preparation

To use TAD (Tabular Anomaly Detection), the following data preparation is necessary. TAD detects anomalies based on unsupervised learning, so a label column (y column) is not mandatory. However, preparing about 5% of anomalous data compared to the amount of normal data can help improve performance.

Preparing Training Data

Prepare a csv file consisting of the data you want to detect anomalies in and the feature columns.
The label column is optional.
If grouping, there should be a column for grouping.

Training Dataset Example

x_col_1	x_col_2	time_col(optional)	grouupkey(optional)	y_col(optional)
value 1_1	value 1_2	time 1	group1	ok
value 2_1	value2_2	time 2	group2	ok
value 3_1	value3_2	time 3	group1	ng
...	...	...	...	...

Input Data Directory Structure Example

To use ALO, train and inference files must be separated. Please separate the data for training and inference as shown below.
All files under one folder are combined into a single dataframe in the input asset and used for modeling. (Files in subfolders under the path are also combined.)
The columns of data in one folder must all be the same. Training Data Directory Structure Example:

./{train_folder}/
    └ train_data1.csv
    └ train_data2.csv
    └ train_data3.csv 

Validation Data Directory Structure Example:

./{inference_folder}/
    └ inference_data1.csv
    └ inference_data2.csv
    └ inference_data3.csv 

The data to be used for training combines all .csv files within one folder to create a single dataframe. This dataframe is used for model training.

Data Requirements

Essential Requirements

Input data must satisfy the following conditions:
- A y column (label) is not necessary.
- The higher the proportion of normal data, the better. (It's fine to have only normal data.)

Additional Requirements

To ensure minimum performance, we recommend data where the proportion of normal data is greater than that of anomalous data.
If the data ratios are similar, we recommend proceeding after oversampling the normal data.

Outputs

When you run MLAD's training/inference, the following outputs are generated.

Train pipeline

./train_artifacts/
   ├── extra_output/
   │   └── train/
   │       └── eval.json
   ├── log/
   │   ├── experimental_history.json
   │   ├── pipeline.log
   │   └── process.log
   ├── models/
   │   ├── input/
   │   │   └── input_config.json
   │   ├── readiness/
   │   │   └── train_config.pickle
   │   │   └── report.csv
   │   ├── preprocess/
   │   │   └── preprocess_config.json
   │   │   └── scaler.pickle
   │   └── train/
   │       ├── isf_default.pkl
   │       ├── knn_default.pkl
   │       ├── lof_default.pkl
   │       ├── ocsvm_default.pkl
   │       ├── dbscan_default.pkl
   │       └── params.json
   ├── output/
   │   └── train_pred.csv
   └── score/
       └── train_summary.yaml

Inference pipeline

./inference_artifacts/
   ├── extra_output/
   │   └── readiness/
   │   │   └── report.csv
   │   └── inference/
   │       └── eval.json
   ├── log/
   │   ├── experimental_history.json
   │   ├── pipeline.log
   │   └── process.log
   ├── output/
   │   └── test_pred.csv
   └── score/
       └── inference_summary.yaml

Detailed Explanation of Outputs

log/

pipeline.log: The entire log of the pipeline is recorded. You can check detailed logs of data preprocessing, model training, and inference processes.
process.log: This is the log of process execution.
experimental_history.json: Contains information about the experimental history.

models/

input/input_config.json: Config for data input is saved.
readiness/: Config for data readiness and report on target data are saved.
- train_config.pkl: Readiness config
- report.csv: Readiness report on input data
preprocess/: Config and preprocessing objects used for data preprocessing are saved.
- scaler.pickle: Scaler object of training data
- preprocess_config.json: Preprocess config
train/: These are model files trained using various algorithms.
- isf_default.pkl: Isolation Forest model
- knn_default.pkl: K-Nearest Neighbors model
- lof_default.pkl: Local Outlier Factor model
- ocsvm_default.pkl: One-Class SVM model
- dbscan.pkl: DBSCAN model
- params.json: Information on used model parameters

output/

train_pred.csv: Prediction results for training data are saved.
test_pred.csv: Prediction results for inference data are saved.

extra_output/

eval.json: Evaluation metrics and results are saved.

score/

train_summary.yaml: This is a summary file of training results. It includes HPO (Hyperparameter Optimization) results and model performance metrics.
inference_summary.yaml: This is a summary file of inference results. It includes inference results and evaluation metrics using the trained model.

TAD Version: 1.0.0

Data Preparation​

Preparing Training Data​

Training Dataset Example​

Input Data Directory Structure Example​

Data Requirements​

Essential Requirements​

Additional Requirements​

Outputs​

Train pipeline​

Inference pipeline​

Detailed Explanation of Outputs​

log/​

models/​

output/​

extra_output/​

score/​