AD Input and Artifacts
Data Preparation
Preparing Training Data
- Prepare a CSV file with the points you want to detect anomalies for as columns.
- All CSV files must have a time column to identify each row.
- If the time column contains duplicates, you can configure it to drop them. If not dropped, there should be additional columns to distinguish each row.
- Label columns are optional. If they exist, they must be present for all x columns.
- To group data, there must be a column for grouping.
Example of Training Dataset
time_col | x_col_1 | x_col_2 | groupkey |
---|---|---|---|
time 1 | value 1_1 | value 1_2 | group1 |
time 2 | value 2_1 | value 2_2 | group2 |
time 3 | value 3_1 | value 3_2 | group1 |
... | ... | ... | ... |
Example of Input Data Directory Structure
- To use ALO, separate the training and inference files. Prepare the data for training and inference separately.
- All files in a folder will be combined into a single dataframe for modeling. (Files in subfolders will also be combined.)
- The columns of all data in a folder must be the same.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv
Data Requirements
Mandatory Requirements
Input data must meet the following conditions.
index | item | spec. |
---|---|---|
1 | Number of x columns | At least 1 |
2 | Number of time columns | 1 |
3 | Data type of x columns | float |
Additional Requirements
These conditions ensure minimum performance. The algorithm will run even if these conditions are not met, but performance is not guaranteed.
index | item | spec. |
---|---|---|
1 | time column | Time column values should have minimal duplicates. Duplicate time column values can result in unintended deletions if configured to drop them. |
2 | NG data | Lack of NG data may impact performance. |
3 | group key | Each group should have sufficient data. Lack of data in groups may negatively affect performance. |
Artifacts
Running training/inference generates the following artifacts.
Train Pipeline
./alo/train_artifacts/
└ models/preprocess/
└ train_config.pkl
└ train_pipeline_x.pkl
└ models/train/prep_{x column name}
└ train_params.pkl
└ output/
└ train_result.csv
└ extra_output/train/prep_{x column name}
└ confusion_matrix.jpg (generated if y column is provided)
└ plot_anomaly_model_best_model.jpg (generated if y column is provided)
└ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
└ score_compare.jpg (generated if y column is provided)
Inference Pipeline
./alo/inference_artifacts/
└ output/inference/
└ output.csv
└ extra_output/inference/prep_{x column name}
└ confusion_matrix.jpg (generated if y column is provided)
└ plot_anomaly_model_best_model.jpg (generated if y column is provided)
└ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
└ score_compare.jpg (generated if y column is provided)
└ score/
└ inference_summary.yaml
The detailed descriptions of each artifact are as follows.
train_config.pkl
A pickle file containing the arguments of the preprocess asset.
train_pipeline_x.pkl
A pickle file that stores the model for preprocessing x columns in the preprocess asset.
train_params.pkl
A pkl file that stores the model after training in the train asset.
train_result.csv
A CSV file that stores the results after the train pipeline is completed.
confusion_matrix.jpg
A JPG file plotting the confusion matrix of the model(s) using the train data, generated if the y column is provided.
plot_anomaly_model_best_model.jpg
A JPG file plotting the anomaly detection results of the best-performing model using the train data, generated if the y column is provided.
plot_anomaly_model_{selected model name}.jpg
A JPG file plotting the anomaly detection results of the selected model using the train data, generated if the y column is provided.
score_compare.jpg
A JPG file comparing the performance results of the models using the train data, generated if the y column is provided.
output.csv
A CSV file that stores the inference results.
inference_summary.yaml
A YAML file summarizing the inference results. It contains information displayed in the Mellerikat's edge viewer, including date, file_path, note, probability, result, score, and version.
AD Version: 2.0.1