FCST Input and Artifacts
Data Preparation
Preparing Training Data
- Prepare train and inference datasets in .csv format with identical columns. The train set can be a single time series or a collection of multiple time series. If it is a collection of multiple time series, a
groupkey_column
is required to distinguish each time series. - When
do_validation
is performed,Forecasting
divides the train set into train and validation sets, performing cross-validation. Thus, users should prepare data long enough to cover theinput_chunk_length
(length of the input timeseries used for prediction),forecast_periods
(length of the timeseries to predict), andcv_numbers
(number of cross-validations). - The inference set can also be a single time series or a collection of multiple time series. The columns of the inference set must be identical to those of the train set used for model training.
- The prediction period is given by
forecast_periods
, and the prediction start point is generally the point immediately following the final time point in the inference set. To forecast from the end of the train set, use the train set as the inference set. To start predictions from the maximum time index in each group, use theglobal_padding_interpolation
feature in thebizpreprocess asset
to pad and then proceed. - The default value for
time_format
is "%Y-%m-%d". If this is modified, it must be reflected in thetime_column
through thereadiness asset
. Refer to the format codes here.
-
To run
main.py
or a Jupyter notebook, prepare the training data in the following paths:- Train set file path:
./sample_data/train_input_path/
- Inference set file path:
./sample_data/inference_input_path/
Input Data Directory Structure
./train_folder/
└ train_data1.csv
└ train_data2.csv
└ train_data3.csv
./inference_folder/
└ data_a.csv
└ data_b.csvExample train_sample.csv Consider a model that forecasts sales based on average temperature and other information such as franchise and food type. Parameters are as follows, and the example shows a model that learns for 3 days (
input_chunk_length
) and predicts for 3 days (forecast_periods
).- y_column: Sales
- time_column: time
- time_format: "%Y-%m-%d"
- sample_frequency: daily
- groupkey_column: Franchise
- static_covariates: Food
- x_covariates: Average Temperature
- input_chunk_length: 3
- forecast_periods: 3
time Sales Franchise Food Average Temperature 2023-03-10 100 KyoChon Chicken 0 2023-03-11 150 KyoChon Chicken -2 2023-03-12 120 BHX Chicken 5 ... ... ... ... ... 2024-03-12 80 McDonald's Burger 4 Example inference_sample.csv Since
input_chunk_length
is 3, at least 3 days of data are required. Becauseforecast_periods
is 3, data will be generated for 3/13 - 3/15.time Sales Franchise Food Average Temperature 2024-03-10 95 KyoChon Chicken 5 2024-03-11 100 KyoChon Chicken 10 2024-03-12 150 KyoChon Chicken 8 ... ... ... ... ... - Train set file path:
Data Requirements
Mandatory Requirements
The input data must meet the following conditions to avoid requiring reprocessing:
Index | Item | Specification |
---|---|---|
1 | Presence of Missing Values | There must be no missing values in any variable. The process will not function correctly with missing values. |
2 | Presence of Missing Times | There must be no missing times. While interpolation can auto-fill missing values, there should be no gaps for accurate prediction. |
3 | Presence of Duplicate Times | Each group must not have duplicate times. For example, if there are two records for 3/14 in Group A, one will be deleted. |
4 | New Group Keys or Categories | There must be no new group keys or category values at the inference stage. This could cause malfunctions with the Label Encoder. |
5 | Training Data Size | Data exceeding 1GB may not function correctly. |
Additional Requirements
The following conditions are ideal for stable performance. Meeting these conditions increases the likelihood of higher-quality predictions.
Index | Item | Specification |
---|---|---|
1 | Seasonal patterns should ideally repeat 2-3 times within the training data. | Prevents underfitting |
2 | The distribution of data used for prediction should be similar to the training data. | Avoids data shift |
3 | There should be no missing data points in the prediction dataset. | Prevents prediction bias due to interpolation |
4 | Nominal variables should not have too many unique values and should appear frequently enough. | Avoids curse of dimensionality |
5 | The dependent variable should ideally follow a normal distribution. | Assumption of linear regression |
6 | There should be minimal similarity between independent variables. | Avoids multicollinearity |
7 | Causal relationships between independent and dependent variables should be clear. | Causal information |
8 | Exclude unnecessary independent variables. | Avoids spurious correlations |
Artifacts
Executing training/inference generates the following artifacts.
Train Pipeline
./fcst/train_artifacts/
└ models/
└ input/
└ train_config.json
└ readiness/
└ readiness_config.json
└ bizpreprocess/
└ preprocess_config.json
└ preprocess_scaler.pkl
└ preprocess_categorical_encoder.pkl
└ train/
└ train_config.json
└ trained_nbeats.pt
└ trained_nbeats.pt.ckpt
└ output/
└ train_score.csv
└ extra_output/
└ CV{k}.csv
Inference Pipeline
./fcst/inference_artifacts/
└ output/
└ nbeats_prediction_result.csv
└ score/
└ inference_summary.yaml
Each artifact is described in detail below.
trained_nbeats.pt
The trained model.
trained_nbeats.pt.ckpt
The checkpoint of the trained model.
train_score.csv
The performance of cross-validation is saved in this file.
CV | train_period | valid_period | mape | Franchise |
---|---|---|---|---|
1 | 2022-01-01 ~ 2023-12-31 | 2024-01-01 ~ 2024-01-31 | 0.6 | KyoChon |
2 | 2022-01-01 ~ 2023-11-30 | 2023-12-01 ~ 2024-12-31 | 0.8 | KyoChon |
3 | 2022-01-01 ~ 2023-10-31 | 2023-11-01 ~ 2024-11-30 | 0.4 | KyoChon |
... | ... | ... | ... | ... |
CV{k}.csv
The actual and predicted values for each CV are stored in this file.
timestamp | Sales | predicted | Franchise |
---|---|---|---|
2024-02-29 | 95 | 92 | KyoChon |
2024-02-29 | 100 | 84 | KyoChon |
2024-02-29 | 150 | 149 | KyoChon |
... | ... | ... | ... |
nbeats_prediction_result.csv
The results predicted by the trained model.
Example of nbeats_prediction_result.csv
timestamp | targetdate | Sales | Franchise |
---|---|---|---|
2024-02-29 | 2024-03-01 | 95 | KyoChon |
2024-02-29 | 2024-03-02 | 100 | KyoChon |
2024-02-29 | 2024-03-03 | 150 | KyoChon |
... | ... | ... | ... |
inference_summary.yaml
This file contains a summary of the inference. Currently, it is saved with default values.
Forecasting Version: 2.1.0