TAD Input Data and Output Manual
Updated 2024.07.06
Data Preparation
To use TAD (Tabular Anomaly Detection), the following data preparation is necessary. TAD detects anomalies based on unsupervised learning, so a label column (y column) is not mandatory. However, preparing about 5% of anomalous data compared to the amount of normal data can help improve performance.
Preparing Training Data
- Prepare a csv file consisting of the data you want to detect anomalies in and the feature columns.
- The label column is optional.
- If grouping, there should be a column for grouping.
Training Dataset Example
x_col_1 | x_col_2 | time_col(optional) | grouupkey(optional) | y_col(optional) |
---|---|---|---|---|
value 1_1 | value 1_2 | time 1 | group1 | ok |
value 2_1 | value2_2 | time 2 | group2 | ok |
value 3_1 | value3_2 | time 3 | group1 | ng |
... | ... | ... | ... | ... |
Input Data Directory Structure Example
- To use ALO, train and inference files must be separated. Please separate the data for training and inference as shown below.
- All files under one folder are combined into a single dataframe in the input asset and used for modeling. (Files in subfolders under the path are also combined.)
- The columns of data in one folder must all be the same. Training Data Directory Structure Example:
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
Validation Data Directory Structure Example:
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv
The data to be used for training combines all .csv files within one folder to create a single dataframe. This dataframe is used for model training.
Data Requirements
Essential Requirements
- Input data must satisfy the following conditions:
- A y column (label) is not necessary.
- The higher the proportion of normal data, the better. (It's fine to have only normal data.)
Additional Requirements
- To ensure minimum performance, we recommend data where the proportion of normal data is greater than that of anomalous data.
- If the data ratios are similar, we recommend proceeding after oversampling the normal data.
Outputs
When you run MLAD's training/inference, the following outputs are generated.
Train pipeline
./train_artifacts/
├── extra_output/
│ └── train/
│ └── eval.json
├── log/
│ ├── experimental_history.json
│ ├── pipeline.log
│ └── process.log
├── models/
│ ├── input/
│ │ └── input_config.json
│ ├── readiness/
│ │ └── train_config.pickle
│ │ └── report.csv
│ ├── preprocess/
│ │ └── preprocess_config.json
│ │ └── scaler.pickle
│ └── train/
│ ├── isf_default.pkl
│ ├── knn_default.pkl
│ ├── lof_default.pkl
│ ├── ocsvm_default.pkl
│ ├── dbscan_default.pkl
│ └── params.json
├── output/
│ └── train_pred.csv
└── score/
└── train_summary.yaml
Inference pipeline
./inference_artifacts/
├ ── extra_output/
│ └── readiness/
│ │ └── report.csv
│ └── inference/
│ └── eval.json
├── log/
│ ├── experimental_history.json
│ ├── pipeline.log
│ └── process.log
├── output/
│ └── test_pred.csv
└── score/
└── inference_summary.yaml
Detailed Explanation of Outputs
log/
pipeline.log
: The entire log of the pipeline is recorded. You can check detailed logs of data preprocessing, model training, and inference processes.process.log
: This is the log of process execution.experimental_history.json
: Contains information about the experimental history.
models/
input/input_config.json
: Config for data input is saved.readiness/
: Config for data readiness and report on target data are saved.train_config.pkl
: Readiness configreport.csv
: Readiness report on input data
preprocess/
: Config and preprocessing objects used for data preprocessing are saved.scaler.pickle
: Scaler object of training datapreprocess_config.json
: Preprocess config
train/
: These are model files trained using various algorithms.isf_default.pkl
: Isolation Forest modelknn_default.pkl
: K-Nearest Neighbors modellof_default.pkl
: Local Outlier Factor modelocsvm_default.pkl
: One-Class SVM modeldbscan.pkl
: DBSCAN modelparams.json
: Information on used model parameters
output/
train_pred.csv
: Prediction results for training data are saved.test_pred.csv
: Prediction results for inference data are saved.
extra_output/
eval.json
: Evaluation metrics and results are saved.
score/
train_summary.yaml
: This is a summary file of training results. It includes HPO (Hyperparameter Optimization) results and model performance metrics.inference_summary.yaml
: This is a summary file of inference results. It includes inference results and evaluation metrics using the trained model.
TAD Version: 1.0.0