Skip to main content
Version: Next

AD Input and Artifacts

Updated 2024.05.17

Data Preparation

Preparing Training Data

  1. Prepare a CSV file with the points you want to detect anomalies for as columns.
  2. All CSV files must have a time column to identify each row.
  3. If the time column contains duplicates, you can configure it to drop them. If not dropped, there should be additional columns to distinguish each row.
  4. Label columns are optional. If they exist, they must be present for all x columns.
  5. To group data, there must be a column for grouping.

Example of Training Dataset

time_colx_col_1x_col_2groupkey
time 1value 1_1value 1_2group1
time 2value 2_1value 2_2group2
time 3value 3_1value 3_2group1
............

Example of Input Data Directory Structure

  • To use ALO, separate the training and inference files. Prepare the data for training and inference separately.
  • All files in a folder will be combined into a single dataframe for modeling. (Files in subfolders will also be combined.)
  • The columns of all data in a folder must be the same.
./{train_folder}/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./{inference_folder}/
└ inference_data1.csv
└ inference_data2.csv
└ inference_data3.csv


Data Requirements

Mandatory Requirements

Input data must meet the following conditions.

indexitemspec.
1Number of x columnsAt least 1
2Number of time columns1
3Data type of x columnsfloat

Additional Requirements

These conditions ensure minimum performance. The algorithm will run even if these conditions are not met, but performance is not guaranteed.

indexitemspec.
1time columnTime column values should have minimal duplicates. Duplicate time column values can result in unintended deletions if configured to drop them.
2NG dataLack of NG data may impact performance.
3group keyEach group should have sufficient data. Lack of data in groups may negatively affect performance.


Artifacts

Running training/inference generates the following artifacts.

Train Pipeline

./alo/train_artifacts/
└ models/preprocess/
└ train_config.pkl
└ train_pipeline_x.pkl
└ models/train/prep_{x column name}
└ train_params.pkl
└ output/
└ train_result.csv
└ extra_output/train/prep_{x column name}
└ confusion_matrix.jpg (generated if y column is provided)
└ plot_anomaly_model_best_model.jpg (generated if y column is provided)
└ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
└ score_compare.jpg (generated if y column is provided)

Inference Pipeline

./alo/inference_artifacts/
└ output/inference/
└ output.csv
└ extra_output/inference/prep_{x column name}
└ confusion_matrix.jpg (generated if y column is provided)
└ plot_anomaly_model_best_model.jpg (generated if y column is provided)
└ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
└ score_compare.jpg (generated if y column is provided)
└ score/
└ inference_summary.yaml

The detailed descriptions of each artifact are as follows.

train_config.pkl

A pickle file containing the arguments of the preprocess asset.

train_pipeline_x.pkl

A pickle file that stores the model for preprocessing x columns in the preprocess asset.

train_params.pkl

A pkl file that stores the model after training in the train asset.

train_result.csv

A CSV file that stores the results after the train pipeline is completed.

confusion_matrix.jpg

A JPG file plotting the confusion matrix of the model(s) using the train data, generated if the y column is provided.

plot_anomaly_model_best_model.jpg

A JPG file plotting the anomaly detection results of the best-performing model using the train data, generated if the y column is provided.

plot_anomaly_model_{selected model name}.jpg

A JPG file plotting the anomaly detection results of the selected model using the train data, generated if the y column is provided.

score_compare.jpg

A JPG file comparing the performance results of the models using the train data, generated if the y column is provided.

output.csv

A CSV file that stores the inference results.

inference_summary.yaml

A YAML file summarizing the inference results. It contains information displayed in the Mellerikat's edge viewer, including date, file_path, note, probability, result, score, and version.



AD Version: 2.0.1