Skip to main content
Version: Next

FCST Input and Artifacts

Updated 2024.05.05

Data Preparation

Preparing Training Data

  1. Prepare train and inference datasets in .csv format with identical columns. The train set can be a single time series or a collection of multiple time series. If it is a collection of multiple time series, a groupkey_column is required to distinguish each time series.
  2. When do_validation is performed, Forecasting divides the train set into train and validation sets, performing cross-validation. Thus, users should prepare data long enough to cover the input_chunk_length (length of the input timeseries used for prediction), forecast_periods (length of the timeseries to predict), and cv_numbers (number of cross-validations).
  3. The inference set can also be a single time series or a collection of multiple time series. The columns of the inference set must be identical to those of the train set used for model training.
  4. The prediction period is given by forecast_periods, and the prediction start point is generally the point immediately following the final time point in the inference set. To forecast from the end of the train set, use the train set as the inference set. To start predictions from the maximum time index in each group, use the global_padding_interpolation feature in the bizpreprocess asset to pad and then proceed.
  5. The default value for time_format is "%Y-%m-%d". If this is modified, it must be reflected in the time_column through the readiness asset. Refer to the format codes here.
  • To run main.py or a Jupyter notebook, prepare the training data in the following paths:

    • Train set file path: ./sample_data/train_input_path/
    • Inference set file path: ./sample_data/inference_input_path/

    Input Data Directory Structure

    ./train_folder/
    └ train_data1.csv
    └ train_data2.csv
    └ train_data3.csv
    ./inference_folder/
    └ data_a.csv
    └ data_b.csv

    Example train_sample.csv Consider a model that forecasts sales based on average temperature and other information such as franchise and food type. Parameters are as follows, and the example shows a model that learns for 3 days (input_chunk_length) and predicts for 3 days (forecast_periods).

    • y_column: Sales
    • time_column: time
    • time_format: "%Y-%m-%d"
    • sample_frequency: daily
    • groupkey_column: Franchise
    • static_covariates: Food
    • x_covariates: Average Temperature
    • input_chunk_length: 3
    • forecast_periods: 3
    timeSalesFranchiseFoodAverage Temperature
    2023-03-10100KyoChonChicken0
    2023-03-11150KyoChonChicken-2
    2023-03-12120BHXChicken5
    ...............
    2024-03-1280McDonald'sBurger4

    Example inference_sample.csv Since input_chunk_length is 3, at least 3 days of data are required. Because forecast_periods is 3, data will be generated for 3/13 - 3/15.

    timeSalesFranchiseFoodAverage Temperature
    2024-03-1095KyoChonChicken5
    2024-03-11100KyoChonChicken10
    2024-03-12150KyoChonChicken8
    ...............

Data Requirements

Mandatory Requirements

The input data must meet the following conditions to avoid requiring reprocessing:

IndexItemSpecification
1Presence of Missing ValuesThere must be no missing values in any variable. The process will not function correctly with missing values.
2Presence of Missing TimesThere must be no missing times. While interpolation can auto-fill missing values, there should be no gaps for accurate prediction.
3Presence of Duplicate TimesEach group must not have duplicate times. For example, if there are two records for 3/14 in Group A, one will be deleted.
4New Group Keys or CategoriesThere must be no new group keys or category values at the inference stage. This could cause malfunctions with the Label Encoder.
5Training Data SizeData exceeding 1GB may not function correctly.

Additional Requirements

The following conditions are ideal for stable performance. Meeting these conditions increases the likelihood of higher-quality predictions.

IndexItemSpecification
1Seasonal patterns should ideally repeat 2-3 times within the training data.Prevents underfitting
2The distribution of data used for prediction should be similar to the training data.Avoids data shift
3There should be no missing data points in the prediction dataset.Prevents prediction bias due to interpolation
4Nominal variables should not have too many unique values and should appear frequently enough.Avoids curse of dimensionality
5The dependent variable should ideally follow a normal distribution.Assumption of linear regression
6There should be minimal similarity between independent variables.Avoids multicollinearity
7Causal relationships between independent and dependent variables should be clear.Causal information
8Exclude unnecessary independent variables.Avoids spurious correlations

Artifacts

Executing training/inference generates the following artifacts.

Train Pipeline

./fcst/train_artifacts/
└ models/
└ input/
└ train_config.json
└ readiness/
└ readiness_config.json
└ bizpreprocess/
└ preprocess_config.json
└ preprocess_scaler.pkl
└ preprocess_categorical_encoder.pkl
└ train/
└ train_config.json
└ trained_nbeats.pt
└ trained_nbeats.pt.ckpt
└ output/
└ train_score.csv
└ extra_output/
└ CV{k}.csv

Inference Pipeline

./fcst/inference_artifacts/
└ output/
└ nbeats_prediction_result.csv
└ score/
└ inference_summary.yaml

Each artifact is described in detail below.

trained_nbeats.pt

The trained model.

trained_nbeats.pt.ckpt

The checkpoint of the trained model.

train_score.csv

The performance of cross-validation is saved in this file.

CVtrain_periodvalid_periodmapeFranchise
12022-01-01 ~ 2023-12-312024-01-01 ~ 2024-01-310.6KyoChon
22022-01-01 ~ 2023-11-302023-12-01 ~ 2024-12-310.8KyoChon
32022-01-01 ~ 2023-10-312023-11-01 ~ 2024-11-300.4KyoChon
...............

CV{k}.csv

The actual and predicted values for each CV are stored in this file.

timestampSalespredictedFranchise
2024-02-299592KyoChon
2024-02-2910084KyoChon
2024-02-29150149KyoChon
............

nbeats_prediction_result.csv

The results predicted by the trained model.

Example of nbeats_prediction_result.csv

timestamptargetdateSalesFranchise
2024-02-292024-03-0195KyoChon
2024-02-292024-03-02100KyoChon
2024-02-292024-03-03150KyoChon
............

inference_summary.yaml

This file contains a summary of the inference. Currently, it is saved with default values.


Forecasting Version: 2.1.0