Version: Next

AD Input and Artifacts

Updated 2024.05.17

Data Preparation

Preparing Training Data

Prepare a CSV file with the points you want to detect anomalies for as columns.
All CSV files must have a time column to identify each row.
If the time column contains duplicates, you can configure it to drop them. If not dropped, there should be additional columns to distinguish each row.
Label columns are optional. If they exist, they must be present for all x columns.
To group data, there must be a column for grouping.

Example of Training Dataset

time_col	x_col_1	x_col_2	groupkey
time 1	value 1_1	value 1_2	group1
time 2	value 2_1	value 2_2	group2
time 3	value 3_1	value 3_2	group1
...	...	...	...

Example of Input Data Directory Structure

To use ALO, separate the training and inference files. Prepare the data for training and inference separately.
All files in a folder will be combined into a single dataframe for modeling. (Files in subfolders will also be combined.)
The columns of all data in a folder must be the same.

./{train_folder}/
    └ train_data1.csv
    └ train_data2.csv
    └ train_data3.csv
./{inference_folder}/
    └ inference_data1.csv
    └ inference_data2.csv
    └ inference_data3.csv

Data Requirements

Mandatory Requirements

Input data must meet the following conditions.

index	item	spec.
1	Number of x columns	At least 1
2	Number of time columns	1
3	Data type of x columns	float

Additional Requirements

These conditions ensure minimum performance. The algorithm will run even if these conditions are not met, but performance is not guaranteed.

index	item	spec.
1	time column	Time column values should have minimal duplicates. Duplicate time column values can result in unintended deletions if configured to drop them.
2	NG data	Lack of NG data may impact performance.
3	group key	Each group should have sufficient data. Lack of data in groups may negatively affect performance.

Artifacts

Running training/inference generates the following artifacts.

Train Pipeline

./alo/train_artifacts/
    └ models/preprocess/
        └ train_config.pkl
        └ train_pipeline_x.pkl
    └ models/train/prep_{x column name}
        └ train_params.pkl
    └ output/
        └ train_result.csv
    └ extra_output/train/prep_{x column name}
        └ confusion_matrix.jpg (generated if y column is provided)
        └ plot_anomaly_model_best_model.jpg (generated if y column is provided)
        └ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
        └ score_compare.jpg (generated if y column is provided)

Inference Pipeline

./alo/inference_artifacts/
    └ output/inference/
        └ output.csv
    └ extra_output/inference/prep_{x column name}
        └ confusion_matrix.jpg (generated if y column is provided)
        └ plot_anomaly_model_best_model.jpg (generated if y column is provided)
        └ plot_anomaly_model_{selected model name}.jpg (generated if y column is provided)
        └ score_compare.jpg (generated if y column is provided)
    └ score/
        └ inference_summary.yaml

The detailed descriptions of each artifact are as follows.

train_config.pkl

A pickle file containing the arguments of the preprocess asset.

train_pipeline_x.pkl

A pickle file that stores the model for preprocessing x columns in the preprocess asset.

train_params.pkl

A pkl file that stores the model after training in the train asset.

train_result.csv

A CSV file that stores the results after the train pipeline is completed.

confusion_matrix.jpg

A JPG file plotting the confusion matrix of the model(s) using the train data, generated if the y column is provided.

plot_anomaly_model_best_model.jpg

A JPG file plotting the anomaly detection results of the best-performing model using the train data, generated if the y column is provided.

plot_anomaly_model_{selected model name}.jpg

A JPG file plotting the anomaly detection results of the selected model using the train data, generated if the y column is provided.

score_compare.jpg

A JPG file comparing the performance results of the models using the train data, generated if the y column is provided.

output.csv

A CSV file that stores the inference results.

inference_summary.yaml

A YAML file summarizing the inference results. It contains information displayed in the Mellerikat's edge viewer, including date, file_path, note, probability, result, score, and version.

AD Version: 2.0.1

Data Preparation​

Preparing Training Data​

Example of Training Dataset​

Example of Input Data Directory Structure​

Data Requirements​

Mandatory Requirements​

Additional Requirements​

Artifacts​

Train Pipeline​

Inference Pipeline​

train_config.pkl​

train_pipeline_x.pkl​

train_params.pkl​

train_result.csv​

confusion_matrix.jpg​

plot_anomaly_model_best_model.jpg​

plot_anomaly_model_{selected model name}.jpg​

score_compare.jpg​

output.csv​

inference_summary.yaml​