버전: docs v25.02

AD Input and Artifacts

Updated 2024.05.17

데이터 준비

학습 데이터 준비

이상 탐지를 하고자 하는 포인트들이 컬럼으로 이루어진 csv 파일을 준비합니다.
모든 csv 파일은 해당 row를 식별할 수 있는 time column이 존재해야 합니다.
만약 time column이 중복되는 경우 이를 drop 하도록 설정할 수 있습니다. 만약 drop하지 않는 경우 row별로 식별이 가능하도록 하는 컬럼들이 별도로 존재해야 합니다.
label 컬럼은 optional 합니다. 만약 존재하는 경우 x 컬럼 별로 모두 존재해야 합니다.
그룹 별로 묶는 경우 그룹 별로 묶기 위한 컬럼이 존재해야 합니다.

학습 데이터셋 예시

time_col	x_col_1	x_col_2	grouupkey
time 1	value 1_1	value 1_2	group1
time 2	value 2_1	value2_2	group2
time 3	value 3_1	value3_2	group1
...	...	...	...

input data directory 구조 예시

ALO를 사용하기 위해서는 train과 inference 파일이 분리되어야 합니다. 아래와 같이 학습에 사용할 데이터와 추론에 사용할 데이터를 구분해주세요.
하나의 폴더 아래 있는 모든 파일을 input asset에서 취합해 하나의 dataframe으로 만든 후 모델링에 사용됩니다. (경로 밑 하위 폴더 안에 있는 파일도 합쳐집니다.)
하나의 폴더 안에 있는 데이터의 컬럼은 모두 동일해야 합니다.

./{train_folder}/
    └ train_data1.csv
    └ train_data2.csv
    └ train_data3.csv
./{inference_folder}/
    └ inference_data1.csv
    └ inference_data2.csv
    └ inference_data3.csv

데이터 요구사항

필수 요구사항

입력 데이터는 다음 조건을 반드시 만족하여야 합니다.

index	item	spec.
1	x 컬럼의 개수	1개 이상
2	time 컬럼의 개수	1개
3	x 컬럼의 타입	float

추가 요구사항

최소한의 성능을 보장하기 위한 조건입니다. 하기 조건이 만족되지 않아도 알고리즘은 돌아가지만 성능은 확인되지 않았습니다

index	item	spec.
1	time 컬럼	time 컬럼의 값은 중복이 최대한 적어야 합니다. 중복되는 time 컬럼 값이 있는데 중복 time 컬럼을 drop 시키게 하면 원치 않게 핻들이 삭제 될 수 있습니다.
2	NG 데이터	NG 데이터가 아예 존재하지 않는 경우 성능에 영향을 미칠 수 있습니다.
3	group key	group 마다 데이터 양이 충분해야 합니다. 그렇지 않은 경우 성능에 악영향을 미칠 수 있습니다.

산출물

학습/추론을 실행하면 아래와 같은 산출물이 생성됩니다.

Train pipeline

./alo/train_artifacts/
    └ models/preprocess/
        └ train_config.pkl
        └ train_pipeline_x.pkl
    └ models/train/prep_{x 컬럼명}
        └ train_params.pkl
    └ output/
        └ train_result.csv
    └ extra_output/train/prep_{x 컬럼명}
        └ confusion_matrix.jpg (y 컬럼 입력시 생성)
        └ plot_anomaly_model_best_model.jpg (y 컬럼 입력시 생성)
        └ plot_anomaly_model_{선택한 모델 명}.jpg (y 컬럼 입력시 생성)
        └ score_compare.jpg (y 컬럼 입력시 생성)

Inference pipeline

 ./alo/inference_artifacts/
    └ output/inference/
        └ output.csv
    └ extra_output/inference/prep_{x 컬럼명}
        └ confusion_matrix.jpg (y 컬럼 입력시 생성)
        └ plot_anomaly_model_best_model.jpg (y 컬럼 입력시 생성)
        └ plot_anomaly_model_{선택한 모델 명}.jpg (y 컬럼 입력시 생성)
        └ score_compare.jpg (y 컬럼 입력시 생성)
    └ score/
        └ inference_summary.yaml

각 산출물에 대한 상세 설명은 다음과 같습니다.

train_config.pkl

preprocess asset의 argument를 담고 있는 pickle 파일입니다.

train_pipeline_x.pkl

preprocess asset에서 x 컬럼을 전처리하는 모델을 저장한 pickle 파일입니다.

train_params.pkl

train asset에서 학습을 진행 후 모델을 저장한 pkl 파일입니다.

train_result.csv

train pipeline이 끝난 후 결과를 저장한 csv 파일입니다.

confusion_matrix.jpg

y 컬럼이 주어진 경우 모델(들)의 train data를 이용한 confusion matrix를 plot한 jpg 파일입니다.

plot_anomaly_model_best_model.jpg

모델(들) 중 score가 가장 높은 모델이 train data를 anomaly detection한 결과를 plot한 jpg 파일입니다.

plot_anomaly_model_{선택한 모델명}.jpg

모델(들) 중 선택한 모델이 train data를 anomaly detection한 결과를 plot한 jpg 파일입니다.

score_compare.jpg

y컬럼이 주어진 경우 train data를 이용한 모델들간의 성능 결과를 비교한 plot한 jpg 파일입니다.

output.csv

inference 결과를 저장한 csv 파일입니다.

inference_summary.yaml

추론 결과에 대한 요약 정보입니다. Melleriakt의 edge viewer에 출력되는 정보입니다. date, file_path, note, probability, result, score, version으로 이루어져 있습니다.

AD Version: 2.0.1

데이터 준비​

학습 데이터 준비​

학습 데이터셋 예시​

input data directory 구조 예시​

데이터 요구사항​

필수 요구사항​

추가 요구사항​

산출물​

Train pipeline​

Inference pipeline​

train_config.pkl​

train_pipeline_x.pkl​

train_params.pkl​

train_result.csv​

confusion_matrix.jpg​

plot_anomaly_model_best_model.jpg​

plot_anomaly_model_{선택한 모델명}.jpg​

score_compare.jpg​

output.csv​

inference_summary.yaml​