Version: Next

FCST Input and Artifacts

Updated 2024.05.05

Data Preparation

Preparing Training Data

Prepare train and inference datasets in .csv format with identical columns. The train set can be a single time series or a collection of multiple time series. If it is a collection of multiple time series, a groupkey_column is required to distinguish each time series.
When do_validation is performed, Forecasting divides the train set into train and validation sets, performing cross-validation. Thus, users should prepare data long enough to cover the input_chunk_length (length of the input timeseries used for prediction), forecast_periods (length of the timeseries to predict), and cv_numbers (number of cross-validations).
The inference set can also be a single time series or a collection of multiple time series. The columns of the inference set must be identical to those of the train set used for model training.
The prediction period is given by forecast_periods, and the prediction start point is generally the point immediately following the final time point in the inference set. To forecast from the end of the train set, use the train set as the inference set. To start predictions from the maximum time index in each group, use the global_padding_interpolation feature in the bizpreprocess asset to pad and then proceed.
The default value for time_format is "%Y-%m-%d". If this is modified, it must be reflected in the time_column through the readiness asset. Refer to the format codes here.

To run main.py or a Jupyter notebook, prepare the training data in the following paths:
- Train set file path: ./sample_data/train_input_path/
- Inference set file path: ./sample_data/inference_input_path/
Input Data Directory Structure
```
./train_folder/
  └ train_data1.csv
  └ train_data2.csv
  └ train_data3.csv
./inference_folder/
  └ data_a.csv
  └ data_b.csv
```
Example train_sample.csv Consider a model that forecasts sales based on average temperature and other information such as franchise and food type. Parameters are as follows, and the example shows a model that learns for 3 days (input_chunk_length) and predicts for 3 days (forecast_periods).
- y_column: Sales
- time_column: time
- time_format: "%Y-%m-%d"
- sample_frequency: daily
- groupkey_column: Franchise
- static_covariates: Food
- x_covariates: Average Temperature
- input_chunk_length: 3
- forecast_periods: 3
time Sales Franchise Food Average Temperature
2023-03-10 100 KyoChon Chicken 0
2023-03-11 150 KyoChon Chicken -2
2023-03-12 120 BHX Chicken 5
... ... ... ... ...
2024-03-12 80 McDonald's Burger 4

Example inference_sample.csv Since input_chunk_length is 3, at least 3 days of data are required. Because forecast_periods is 3, data will be generated for 3/13 - 3/15.

time Sales Franchise Food Average Temperature
2024-03-10 95 KyoChon Chicken 5
2024-03-11 100 KyoChon Chicken 10
2024-03-12 150 KyoChon Chicken 8
... ... ... ... ...

time	Sales	Franchise	Food	Average Temperature
2023-03-10	100	KyoChon	Chicken	0
2023-03-11	150	KyoChon	Chicken	-2
2023-03-12	120	BHX	Chicken	5
...	...	...	...	...
2024-03-12	80	McDonald's	Burger	4

time	Sales	Franchise	Food	Average Temperature
2024-03-10	95	KyoChon	Chicken	5
2024-03-11	100	KyoChon	Chicken	10
2024-03-12	150	KyoChon	Chicken	8
...	...	...	...	...

Data Requirements

Mandatory Requirements

The input data must meet the following conditions to avoid requiring reprocessing:

Index	Item	Specification
1	Presence of Missing Values	There must be no missing values in any variable. The process will not function correctly with missing values.
2	Presence of Missing Times	There must be no missing times. While interpolation can auto-fill missing values, there should be no gaps for accurate prediction.
3	Presence of Duplicate Times	Each group must not have duplicate times. For example, if there are two records for 3/14 in Group A, one will be deleted.
4	New Group Keys or Categories	There must be no new group keys or category values at the inference stage. This could cause malfunctions with the Label Encoder.
5	Training Data Size	Data exceeding 1GB may not function correctly.

Additional Requirements

The following conditions are ideal for stable performance. Meeting these conditions increases the likelihood of higher-quality predictions.

Index	Item	Specification
1	Seasonal patterns should ideally repeat 2-3 times within the training data.	Prevents underfitting
2	The distribution of data used for prediction should be similar to the training data.	Avoids data shift
3	There should be no missing data points in the prediction dataset.	Prevents prediction bias due to interpolation
4	Nominal variables should not have too many unique values and should appear frequently enough.	Avoids curse of dimensionality
5	The dependent variable should ideally follow a normal distribution.	Assumption of linear regression
6	There should be minimal similarity between independent variables.	Avoids multicollinearity
7	Causal relationships between independent and dependent variables should be clear.	Causal information
8	Exclude unnecessary independent variables.	Avoids spurious correlations

Artifacts

Executing training/inference generates the following artifacts.

Train Pipeline

./fcst/train_artifacts/
	└ models/
	 	└ input/
		  	└ train_config.json
 		└ readiness/
	 		└ readiness_config.json
	 	└ bizpreprocess/
 		 	└ preprocess_config.json
	 		└ preprocess_scaler.pkl
	 		└ preprocess_categorical_encoder.pkl
 	 	└ train/
		  	└ train_config.json
			└ trained_nbeats.pt
			└ trained_nbeats.pt.ckpt
   	└ output/
		└ train_score.csv
	└ extra_output/
    	└ CV{k}.csv

Inference Pipeline

./fcst/inference_artifacts/
	└ output/
   		└ nbeats_prediction_result.csv
   	└ score/
		└ inference_summary.yaml

Each artifact is described in detail below.

trained_nbeats.pt

The trained model.

trained_nbeats.pt.ckpt

The checkpoint of the trained model.

train_score.csv

The performance of cross-validation is saved in this file.

CV	train_period	valid_period	mape	Franchise
1	2022-01-01 ~ 2023-12-31	2024-01-01 ~ 2024-01-31	0.6	KyoChon
2	2022-01-01 ~ 2023-11-30	2023-12-01 ~ 2024-12-31	0.8	KyoChon
3	2022-01-01 ~ 2023-10-31	2023-11-01 ~ 2024-11-30	0.4	KyoChon
...	...	...	...	...

CV{k}.csv

The actual and predicted values for each CV are stored in this file.

timestamp	Sales	predicted	Franchise
2024-02-29	95	92	KyoChon
2024-02-29	100	84	KyoChon
2024-02-29	150	149	KyoChon
...	...	...	...

nbeats_prediction_result.csv

The results predicted by the trained model.

Example of nbeats_prediction_result.csv

timestamp	targetdate	Sales	Franchise
2024-02-29	2024-03-01	95	KyoChon
2024-02-29	2024-03-02	100	KyoChon
2024-02-29	2024-03-03	150	KyoChon
...	...	...	...

inference_summary.yaml

This file contains a summary of the inference. Currently, it is saved with default values.

Forecasting Version: 2.1.0

Data Preparation​

Preparing Training Data​

Data Requirements​

Mandatory Requirements​

Additional Requirements​

Artifacts​

Train Pipeline​

Inference Pipeline​

trained_nbeats.pt​

trained_nbeats.pt.ckpt​

train_score.csv​

CV{k}.csv​

nbeats_prediction_result.csv​

inference_summary.yaml​