Neural Network (MLP) for Time Series Forecasting in Practice

A Practical Example for Feature Engineering and Constructing an MLP Model

16 min read

Jul 8, 2024

Introduction

Time series and more specifically time series forecasting is a very well known data science problem among professionals and business users alike.

Several forecasting methods exist, which may be grouped as statistical or machine learning methods for comprehension and a better overview, but as a matter of fact, the demand for forecasting is so high that the available options are abundant.

Machine learning methods are considered state-of-the-art approach in time series forecasting and are increasing in popularity, due to the fact that they are able to capture complex non-linear relationships within the data and generally yield higher accuracy in forecasting [1]. One popular machine learning field is the landscape of neural networks. Specifically for time series analysis, recurrent neural networks have been developed and applied to solve forecasting problems [2].

Data science enthusiasts might find the complexity behind such models intimidating and being one of you I can tell that I share that feeling. However, this article aims to show that

despite the latest developments in machine learning methods, it is not necessarily worth pursuing the most complex application when looking for a solution for a particular problem. Well-established methods enhanced with powerful feature engineering techniques could still provide satisfactory results.

More specifically, I apply a Multi-Layer Perceptron model and share the code and results, so you can get a hands-on experience on engineering time series features and forecasting effectively.

Goal of the Article

More precisely what I aim at to provide for fellow self-taught professionals, could be summarized in the following points:

  1. forecasting based on real-world problem / data
  2. how to engineer time series features for capturing temporal patterns
  3. build an MLP model capable of utilizing mixed variables: floats and integers (treated as categoricals via embedding)
  4. use MLP for point forecasting
  5. use MLP for multi-step forecasting
  6. assess feature importance using permutation feature importance method
  7. retrain model for a subset of grouped features (multiple groups, trained for individual groups) to refine the feature importance of grouped features
  8. evaluate the model by comparing to an UnobservedComponents model

Key Technical Terms

Please note, that this article assumes the prior knowledge of some key technical terms and do not intend to explain them in details. Find those key terms below, with references provided, which could be checked for clarity:

  1. Time Series [3]
  2. Prediction [4] — in this context it will be used to distinguish model outputs in the training period
  3. Forecast [4] — in this context it will be used to distinguish model outputs in the test period
  4. Feature Engineering [5]
  5. Autocorrelation [6]
  6. Partial Autocorrelation [6]
  7. MLP (Multi-Layer Perceptron) [7]
  8. Input Layer [7]
  9. Hidden Layer [7]
  10. Output Layer [7]
  11. Embedding [8]
  12. State Space Models [9]
  13. Unobserved Components Model [9]
  14. RMSE (Root Mean Squared Error) [10]
  15. Feature Importance [11]
  16. Permutation Feature Importance [11]

Data Exploration

The essential packages used during the analysis are numpy and pandas for data manipulation, plotly for interactive charts, statsmodels for statistics and state space modeling and finally, tensorflow for MLP architcture.

Note: due to technical limitations, I will provide the code snippets for interactive plotting, but the figures will be static presented here.

import opendatasets as od
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import tensorflow as tf

from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf, pacf
import datetime

import warnings
warnings.filterwarnings('ignore')

The data is loaded automatically using opendatasets.

dataset_url = "https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption/"
od.download(dataset_url)
df = pd.read_csv(".hourly-energy-consumption" + "AEP_hourly.csv", index_col=0)
df.sort_index(inplace = True)

Keep in my mind, that data cleaning was an essential first step in order to progress with the analysis. If you are interested in the details and also in state space modeling, please refer to my previous article here. ☚📰 In a nutshell, the following steps were conducted:

  1. Identifying gaps, when specific timestamps are missing (only single steps were identified)
  2. Perform imputation (using mean of previous and next records)
  3. Identifying and dropping duplicates
  4. Set timestamp column as index for dataframe
  5. Set dataframe index frequency to hourly, because it is a requirement for further processing

After preparing the data, let’s explore it by drawing 5 random timestamp samples and compare the time series at different scales.

fig = make_subplots(rows=5, cols=4, shared_yaxes=True, horizontal_spacing=0.01, vertical_spacing=0.04)

# drawing a random sample of 5 indices without repetition
sample = sorted([x for x in np.random.choice(range(0, len(df), 1), 5, replace=False)])

# zoom x scales for plotting
periods = [9000, 3000, 720, 240]

colors = ["#E56399", "#F0B67F", "#DE6E4B", "#7FD1B9", "#7A6563"]

# s for sample datetime start
for si, s in enumerate(sample):

# p for period length
for pi, p in enumerate(periods):
cdf = df.iloc[s:(s+p+1),:].copy()
fig.add_trace(go.Scatter(x=cdf.index,
y=cdf.AEP_MW.values,
marker=dict(color=colors[si])),
row=si+1, col=pi+1)

fig.update_layout(
font=dict(family="Arial"),
margin=dict(b=8, l=8, r=8, t=8),
showlegend=False,
height=1000,
paper_bgcolor="#FFFFFF",
plot_bgcolor="#FFFFFF")
fig.update_xaxes(griddash="dot", gridcolor="#808080")
fig.update_yaxes(griddash="dot", gridcolor="#808080")

Random sampling of dataset and visuals at different time scales. Source: author

State Space Modeling

By closely examining this simple, yet effective plot, for me it is clearly visible, that the analysis should address several seasonal effects:

  1. energy consumption — in general — peak in mid summer and mid winter, regardless of the year selected
  2. a weekly minimum pattern seems to emerge on Mondays
  3. there is a daily minimum during the nights, maximum during the days

Further analysis would reveal, that the yearly pattern of the dataset has 2 harmonics, as the winter and summer peaks have different levels. As a result, the following state space model has been considered, where the periods are measured in hours (see model summary as well below):

# splitting time series to train and test subsets
y_train = df.iloc[:-8766, :].copy()
y_test = df.iloc[-8766:, :].copy()

# Unobserved Components model definition
model = sm.tsa.UnobservedComponents(y_train,
level='dtrend',
irregular=True,
stochastic_level = False,
stochastic_trend = False,
stochastic_freq_seasonal = [False, False, False],
freq_seasonal=[{'period': 24, 'harmonics': 1},
{'period': 168, 'harmonics': 1},
{'period': 8766, 'harmonics': 2}])
# fitting model to train data
model_results = model.fit()

# printing statsmodels summary for model
print(model_results.summary())

Value of `irregular` may be overridden when the trend component is specified using a model string.

Unobserved Components Results
====================================================================================
Dep. Variable: AEP_MW No. Observations: 112530
Model: deterministic trend Log Likelihood -1002257.017
+ freq_seasonal(24(1)) AIC 2004516.033
+ freq_seasonal(168(1)) BIC 2004525.664
+ freq_seasonal(8766(2)) HQIC 2004518.941
Date: Tue, 25 Jun 2024
Time: 08:13:35
Sample: 10-01-2004
- 08-02-2017
Covariance Type: opg
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
sigma2.irregular 3.168e+06 1.3e+04 244.095 0.000 3.14e+06 3.19e+06
===================================================================================
Ljung-Box (L1) (Q): 104573.71 Jarque-Bera (JB): 2731.37
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.04 Skew: 0.35
Prob(H) (two-sided): 0.00 Kurtosis: 3.30
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Without getting too much ahead of myself, let me note, that this model approximates the total energy consumption for the last 365 days with an error of ~2%, which is fairly accurate from a business perspective I believe. The MLP model constructed below will be evaluated by comparing it to the abovementioned state space model.

Feature Engineering

Before constructing the MLP model, we should make the unique trend and seasonal effects available for the model to learn it. That is achieved by adding new features to the dataset, derived from the original 1D time series data. Derived features for capturing already identified or unidentified patterns include:

  1. Lags
  2. Differences
  3. Rolling means
  4. Rolling standard deviations
  5. Hour of the day
  6. Day of week
  7. Labeling weekends

Such derived — and numerical — features could be considered in multiple intervals. In order to determine which intervals a model would benefit, it is highly recommended to check the autocorrelation properties of the dataset.

dff = df.copy()
acorr = acf(dff.AEP_MW.values, nlags=2*366) # autocorrelation
pacorr = pacf(dff.AEP_MW.values, nlags=2*366) # partial autocorrelation

fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0)
fig.add_trace(go.Scatter(
x=np.linspace(0, len(acorr), len(acorr)+1),
y=acorr,
name="Autocorrelation",
marker=dict(color="rgb(180, 120, 80)")
), row=1, col=1)
fig.add_trace(go.Scatter(
x=np.linspace(0, len(pacorr), len(pacorr)+1),
y=pacorr,
name="Partial Autocorrelation",
marker=dict(color="rgb(80, 180, 120)")
), row=2, col=1)
fig.update_layout(
font=dict(family="Arial"),
margin=dict(b=4, l=4, r=4, t=4),
showlegend=False,
height=500,
paper_bgcolor="#FFFFFF",
plot_bgcolor="#FFFFFF")
fig.update_xaxes(griddash="dot", gridcolor="#808080", row=1, col=1)
fig.update_xaxes(griddash="dot", gridcolor="#808080", title_text="No. of lags", row=2, col=1)
fig.update_yaxes(griddash="dot", gridcolor="#808080", title_text="Autocorrelation", row=1, col=1)
fig.update_yaxes(griddash="dot", gridcolor="#808080", title_text="Partial Autocorrelation", row=2, col=1)

Autocorrelation and partial autocorrelation plots of time series. Source: author

The dataset is highly autocorrelated, which makes sense as the values vary mostly between 10K MW and 20K MW with a smooth transition from hour to hour. However, focusing on partial autocorrelations as shown on the plot below, a significant correlation seems to be present in the multiples of 24 hours and in the last couple of hours. As a result, the derived features can be mainly classified as:

  1. Daily (multiples of 24 hours),
  2. Hourly (focusing on the last couple of hours) and
  3. Categorical features
dff = df.reset_index(drop=False)
dff["Datetime"] = pd.to_datetime(dff.Datetime.values)

# lags and difference of multiple days for capturing seasonal effects
for i in np.linspace(24, 15*24, 15, dtype=int):
dff[f"lag_{i}"] = dff.AEP_MW.shift(i)
dff[f"difference_{i}"] = dff.AEP_MW.diff(periods=i)

# rolling mean and standard deviation up to 3 days for capturing seasonal effects better
for i in np.linspace(24, 72, 3, dtype=int):
dff[f"rolling_mean_{i}"] = dff.AEP_MW.rolling(window=i).mean()
dff[f"rolling_std_{i}"] = dff.AEP_MW.rolling(window=i).std()

# lag, rolling mean, rolling standard deviation and difference up to 4 hours for capturing immediate effects
for i in range(2, 5, 1):
dff[f"lag_{i}"] = dff.AEP_MW.shift(i)
dff[f"rolling_mean_{i}"] = dff.AEP_MW.rolling(window=i).mean()
dff[f"rolling_std_{i}"] = dff.AEP_MW.rolling(window=i).std()
dff[f"difference_{i}"] = dff.AEP_MW.diff(periods=i)

# categorical features
dff["hour_of_day"] = dff.Datetime.dt.hour
dff["day_of_week"] = dff.Datetime.dt.day_of_week
dff["is_weekend"] = dff["day_of_week"].isin([5, 6]).astype(int)

# grouping derived features for later use in feature importance analysis
daily_lags = [col for col in dff.columns if all(["lag_" in col, len(col)>5])]
hourly_lags = [col for col in dff.columns if all(["lag_" in col, len(col)<=5])]
daily_differences = [col for col in dff.columns if all(["difference_" in col, len(col)>12])]
hourly_differences = [col for col in dff.columns if all(["difference_" in col, len(col)<=12])]
daily_rolling_means = [col for col in dff.columns if all(["rolling_mean_" in col, len(col)>14])]
hourly_rolling_means = [col for col in dff.columns if all(["rolling_mean_" in col, len(col)<=14])]
daily_rolling_stds = [col for col in dff.columns if all(["rolling_std_" in col, len(col)>13])]
hourly_rolling_stds = [col for col in dff.columns if all(["rolling_std_" in col, len(col)<=13])]
categoricals = ["hour_of_day", "day_of_week", "is_weekend"]

Constructing the MLP Model

Generating the above detailed features, the input shapes are known and the MLP model can be constructed. It is important to notice, that we are dealing with mixed datatypes: floats and integers. Please also note, that while all features are of numerical type, the integer inputs are fundamentally categorical features and should be treated as such.

There is an option to encode the categories with e.g. one hot encoding technique, but that would significantly increase the number of features as each categorical column should be expanded to as many columns as many categories exist (minus one) [12]. I deliberately chose embedding instead to limit the number of features on the expense, that the model input layer will be more complex as the categoricals are converted to vectors via embedding first and then combined with the float inputs.

Please see the graph after the code section for clarity. The architecture has been built using rule of thumbs, as the hyperparameter tuning is out of scope for this article. However, if you are interested in a general framework how it can be done, please check 📰☛ my previous article (I tuned an XGBoost model with Optuna as a tool for Bayesian search of optimal hyperparameter values).

# segmenting last year as test data
inputs = dff.dropna().iloc[:, 2:].columns
xs_train = dff.dropna().iloc[:-8766, 2:]
xs_test = dff.dropna().iloc[-8766:, 2:]
ys_train = dff.dropna().iloc[:-8766, 1]
ys_test = dff.dropna().iloc[-8766:, 1]
embedding_dim = 4 # potential hyperparameter

# defining baseline NN model
float_inputs = tf.keras.layers.Input(shape=(len(inputs)-3,), name="float_inputs") # floats can be directly used in model fitting
integer_inputs = tf.keras.layers.Input(shape=(3,), dtype="int32", name="integer_inputs") # integers should be treated as categoricals ang get them embedded
embedding_layer = tf.keras.layers.Embedding(input_dim=3, output_dim=embedding_dim) # embedding will be performed during model fitting
embedded_integer_inputs = embedding_layer(integer_inputs)
flattened_embeddings = tf.keras.layers.Flatten()(embedded_integer_inputs)
preprocessing_layers = tf.keras.layers.concatenate([float_inputs, flattened_embeddings]) # float and embedded inputs are combined
hidden_layers = tf.keras.layers.Dense(units=64, activation="relu")(preprocessing_layers) # No. of hidden layers, No. of units, activation function are potential hyperparameters
hidden_layers = tf.keras.layers.Dense(units=32, activation="relu")(hidden_layers)
output = tf.keras.layers.Dense(units=1, activation="linear")(hidden_layers) # single unit for one step ahead, multiple units for multiple step prediction
model_NN_baseline = tf.keras.Model(inputs=[float_inputs, integer_inputs], outputs=output)

# compiling baseline NN model
model_NN_baseline.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.MeanSquaredError(),
jit_compile=True)

# fitting baseline NN model
model_NN_baseline.fit(
x=[xs_train.iloc[:, :-3], xs_train.iloc[:, -3:]],
y=ys_train,
validation_data=[[xs_test.iloc[:, :-3], xs_test.iloc[:, -3:]], ys_test],
epochs=128,
batch_size=64,
verbose=1
)

MLP architecture created by using Tensorflow/Keras. Source: author

As far as point forecasts goes, the results are ridiculously accurate. This is a good sign, that the feature engineering principles applied are correctly capturing the underlying patterns in the data and the model was able to generalize it.

Baseline MLP model point-forecasts vs. test data. Source: author

The point forecasts overlap with the test set and the two figure traces are indistinguishable from each other. More precisely, the RMSE of the predictions (training set) and forecasts (test set) are approx. 19.3 and 18.9 respectively (in the ballpark of 0.1% in relative terms).

Feature Importance

What led the model to be accurate? Are all derived features equally important or is there a subset which has a greater weight in determining the outcome? These are valid questions for two distinct reasons:

  1. In real-world scenarios and in the case of big data, the resources for training the model is limited and the amount of data used could make a significant difference if the model could be trained at all
  2. Without any explanation, the model works as a black box, which creates uncertainty regarding its perfomance. Neural Networks are especially prone to be black box models and interpreting them is a field of its own [11]

There is an abundance of techniques to interpret models, each has its pros and cons. I selected the permutation feature importance method to give some insights on model interpretation however, a key takeway from my analysis is that such

model interpretation techniques are only interpreting the model in scope and not necessarily the underlying process itself. Reality could be very different from feature importance analysis, hence it should not be taken as ground truth of causal relationship between independent variables and the target variable.

Let me explain that with my analysis results. Permuting features one at a time, recalculating the RMSE score and recording the relative change in RMSE compared to forecasts using the original data will give the relative importance of features [13].

# permutation feature importance
features = xs_test.columns
permutation_importance_results = {}
rmse = tf.keras.metrics.RootMeanSquaredError()
rmse_permuted = tf.keras.metrics.RootMeanSquaredError()
rmse.update_state(ys_test.values, model_NN_baseline.predict([xs_test.iloc[:, :-3], xs_test.iloc[:, -3:]], verbose=0).flatten())

for feature in features:

xs_test_permuted = xs_test.copy()
xs_test_permuted.loc[:, feature] = xs_test.loc[:, feature].sample(frac=1, axis=0, replace=False, random_state=42).values

rmse_permuted.reset_state()
rmse_permuted.update_state(ys_test.values, model_NN_baseline.predict([xs_test_permuted.iloc[:, :-3], xs_test_permuted.iloc[:, -3:]], verbose=0).flatten())

permutation_importance_results[feature] = rmse_permuted.result().numpy() / rmse.result().numpy()

pi_results_sorted_keys = sorted(permutation_importance_results, key=permutation_importance_results.get, reverse=True)

fig3 = make_subplots()
fig3.add_trace(go.Bar(
x=pi_results_sorted_keys,
y=[permutation_importance_results[key] for key in pi_results_sorted_keys]))
fig3.update_layout(
title="<b>Permutation Feature Importance</b>",
font=dict(family="Arial"),
margin=dict(b=4, l=4, r=4, t=36),
showlegend=False,
height=500,
paper_bgcolor="#FFFFFF",
plot_bgcolor="#FFFFFF"
)
fig3.update_xaxes(griddash="dot", gridcolor="#808080", row=1, col=1)
fig3.update_yaxes(griddash="dot", gridcolor="#808080", row=1, col=1)

Permutation feature importance histogram. Source: author

Hourly-, daily lags and differences seem important and maybe the hourly rolling means as well. However, the daily and hourly rolling standards just as well as the categorical features seem negligible, relative to the aforementioned features. One caveat of permutation feature importance is that it does not take into account multicollinearity and consequently may give inaccurate results. Remember, the features have been derived from a dataset with high autocorrelation.

One possible way to handle the situation is following scikit learn ‘s guidance:

perform hierarchical clustering on the Spearman rank-order correlations, pick a threshold, and keep a single feature from each cluster. [13]

However, I would like to focus on highlighting the inaccuracy and adding more insights to the dataset by training alternative models with the grouped features one group at a time. The same MLP architecture was used for this purpose with adjustments only applied on the input layer to accomodate a subset of data. The following groups were created in the feature engineering section and tested here (train/test dataset RMSE results also reported respectively):

  1. daily lags (942 and 994)
  2. daily differences (1792 and 1952)
  3. hourly lags (686 and 611)
  4. daily rolling means and standard deviations (1710 and 1663)
  5. hourly rolling means and standard deviations (84.4 and 75.5)

It is clear that the alternative models show results not anticipated from simple permutation feature importance analysis, without handling multicollinearity: e.g. daily rolling features yielded better scores than daily differences and the model trained on hourly rolling features has the best performance out of the alternative models, close to the baseline model (RMSE reported in percentage ~0.5% vs. ~0.1% respectively).

A Note on a Specific Anomaly in the Data

I would like to highlight a very specific case of anomaly observed at 14:00 on 20th October 2008. This is the highest ever recorded value with no apparent cause, no similar datapoints before or after in the dataset.

Yet, the baseline model powered by feature engineering was able to predict that datapoint and is not considered an outlier!

Baseline MLP model point-predictions and the observed potential anomaly. Source: author

From which features the model was able to predict that point? Let’s use our alternative models for inference. The best alternative (hourly rolling features) seems extremely accurate in the vicinity, but could only explain the phenomenon partially:

Alternative MLP model (utilizing hourly rolling features) point-predictions and the observed potential anomaly. Source: author

The second best alternative is the one utilizing hourly lags, but it has absolutely no answer why that happened:

Alternative MLP model (utilizing hourly lag features) point-predictions and the observed potential anomaly. Source: author

Making a long story short, the daily differences might contain important information regarding the underlying patterns. Although utilizing the daily differences group solely gives higher predictions, the baseline model seemingly found a good balance of weights for the features.

Alternative MLP model (utilizing daily difference features) point-predictions and the observed potential anomaly. Source: author

Multi-step Prediction Model

Finally, the model architecture has been modified to yield multi-step predictions. The forecasting period is one year, as suggested by the dataset publisher [14]. Given all uncertainties in such a process with special regard to weather conditions, it might not make sense to consider such a long forecasting period. However, it is an intersting exercise to evaluate the multi-step model’s performance to the state space model, which explicitly models the trend and seasonal effects observed across the year (see next section).

The key points for implementing a multi-step model are as follows:

  1. the target was a series of vectors (next 8766 hours defined for ech step)
  2. as a result, the prediction or forecast is the next 8766 hours (approx. one year) for the last row of inputs
  3. due to resource limitations I had to limit the training data for the last year of the former training dataset
  4. the output layer was modified accordingly, to give the desired vector output
first_index = -8766*5
last_index = -8766*2
final_index = -8766
inputs = dff.dropna().iloc[:, 2:].columns
xs_train = dff.dropna().iloc[first_index:last_index, 2:]
xs_train.iloc[:, :-3] = xs_train.iloc[:, :-3].astype(np.float32)
xs_test = dff.dropna().iloc[last_index:final_index, 2:]
xs_test.iloc[:, :-3] = xs_test.iloc[:, :-3].astype(np.float32)
ys_train = np.vstack([dff.dropna().iloc[i:i+8765, 1].astype(int).values for i in range(first_index, last_index, 1)])
ys_test = np.vstack([dff.dropna().iloc[i:i+8765, 1].astype(int).values for i in range(last_index, final_index, 1)])
embedding_dim = 4

# defining, compiling and training NN model for MULTIPLE STEP PREDICTIONS. Model architecture is the same, except output layer
float_inputs = tf.keras.layers.Input(shape=(len(inputs)-3,), name="float_inputs")
integer_inputs = tf.keras.layers.Input(shape=(3,), dtype="int32", name="integer_inputs")
embedding_layer = tf.keras.layers.Embedding(input_dim=3, output_dim=embedding_dim)
embedded_integer_inputs = embedding_layer(integer_inputs)
flattened_embeddings = tf.keras.layers.Flatten()(embedded_integer_inputs)
preprocessing_layers = tf.keras.layers.concatenate([float_inputs, flattened_embeddings])
hidden_layers = tf.keras.layers.Dense(units=64, activation="relu")(preprocessing_layers)
hidden_layers = tf.keras.layers.Dense(units=32, activation="relu")(hidden_layers)
output = tf.keras.layers.Dense(units=np.abs(final_index)-1, activation="linear")(hidden_layers)

model_NN_multistep = tf.keras.Model(inputs=[float_inputs, integer_inputs], outputs=output)
model_NN_multistep.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.MeanSquaredError(),
jit_compile=True)
model_NN_multistep.fit(
x=[xs_train.iloc[:, :-3], xs_train.iloc[:, -3:]],
y=ys_train,
validation_data=[[xs_test.iloc[:, :-3], xs_test.iloc[:, -3:]], ys_test],
epochs=128,
batch_size=64,
verbose=1
)

For a visual evaluation, one could see the model was trying to generalize the patterns:

Multistep MLP model predictions and forecasts vs. original data. Source: author

MLP vs. State Space Model

Due to generalization of the data, the RMSE score increased significantly: 1982 and 2017 for the train and test dataset respectively. However, in order the properly evaluate the multi-step MLP, we should use another model for comparison. As I mentioned in the previous section, state space models gives fairly understandable approximations of the trend and seasonal effects observed across the year. This feature make them relatively easily interpretable, unlike neural networks. The primary reason is that hidden layers have many connections and understanding how they are activated is not a straightforward process. [11].

In my previous article, ☚📰 I used a simplified, yet meaningful evaluation method: comparing the total energy consumption within the last year. Practically, that is the area under the curve of the energy consumption time series. The values for the original data and model forecasts can be compared directly. For the UnobservedComponents model:

y_train = df.iloc[:-8766, 0].values
y_test = df.iloc[-8766:, 0].values
observed_integral = np.cumsum([y_test[x] + (y_test[x+1] - y_test[x]) / 2 for x in range(len(y_test)-1)])[-1]
forecast = model_results.forecast(steps=8766)
UC_integral = np.cumsum([forecast[x] + (forecast[x+1] - forecast[x]) / 2 for x in range(len(forecast)-1)])[-1]

# calculating absolute and percentage error of forecast integral compared to observed integral
fcast_integral_abserror = UC_integral - observed_integral
fcast_integral_perror4 = (UC_integral - observed_integral) * 100 / observed_integral

print(f"Observed yearly energy demand: {'%.3e' % observed_integral} MWh")
print(f"Forecast yearly energy demand: {'%.3e' % UC_integral} MWh")
print(f"Forecast error of yearly energy demand: {'%.3e' % fcast_integral_abserror} MWh or {'%.3f' % fcast_integral_perror4} %")

Observed yearly energy demand: 1.312e+08 MWh
Forecast yearly energy demand: 1.283e+08 MWh
Forecast error of yearly energy demand: -2.832e+06 MWh or -2.159 %

For the MLP model:

y_test = dff.dropna().iloc[-8766:-1, 1].values
observed_integral = np.cumsum([y_test[x] + (y_test[x+1] - y_test[x]) / 2 for x in range(len(y_test)-1)])[-1]
forecast = model_NN_multistep.predict([xs_test.iloc[-1:, :-3], xs_test.iloc[-1:, -3:]], verbose=0).flatten()
model_NN_multistep_integral = np.cumsum([forecast[x] + (forecast[x+1] - forecast[x]) / 2 for x in range(len(forecast)-1)])[-1]

# calculating absolute and percentage error of forecast integral compared to observed integral
fcast_integral_abserror = model_NN_multistep_integral - observed_integral
fcast_integral_perror4 = (model_NN_multistep_integral - observed_integral) * 100 / observed_integral

print(f"Observed yearly energy demand: {'%.3e' % observed_integral} MWh")
print(f"Forecast yearly energy demand: {'%.3e' % model_NN_multistep_integral} MWh")
print(f"Forecast error of yearly energy demand: {'%.3e' % fcast_integral_abserror} MWh or {'%.3f' % fcast_integral_perror4} %")

Observed yearly energy demand: 1.312e+08 MWh
Forecast yearly energy demand: 1.286e+08 MWh
Forecast error of yearly energy demand: -2.508e+06 MWh or -1.912 %

In short: it is -1.912% vs. -2.159% in favor of the MLP model. Please note, that this has been achieved by utilizing an MLP architecture using some simple rule of thumbs, not even considering hyperparameter tuning or some effective model training features, e.g. reducing the learning rate when the evaluation metric reaches a plateau or early stopping.

The results should be fairly convincing that indeed, utilizing relatively simple neural network architectures combined with powerful feature engineering techniques, accurate forecasting tools are within reach for a data scientist early in his or her seniority level.

Resources

Data source:
https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption/ (CC0)

Notebook (only code, without outputs): https://gist.github.com/danielandthelions/2e6f0edd30902113ad10fd9f20bda215

References

[1] https://preset.io/blog/time-series-forecasting-a-complete-guide/

[2] https://www.ibm.com/topics/recurrent-neural-networks

[3] https://www.timescale.com/blog/time-series-analysis-what-is-it-how-to-use-it/

[4] https://plat.ai/blog/difference-between-prediction-and-forecast/

[5] https://dotdata.com/blog/practical-guide-for-feature-engineering-of-time-series-data/

[6] https://statisticsbyjim.com/time-series/autocorrelation-partial-autocorrelation/

[7] https://www.sciencedirect.com/topics/computer-science/multilayer-perceptron

[8] https://jina.ai/news/embeddings-in-depth/

[9] Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. OTexts.com/fpp3. Accessed on 7th July 2024

[10] https://statisticsbyjim.com/regression/root-mean-square-error-rmse/

[11] https://christophm.github.io/interpretable-ml-book/

[12] https://scikit-learn.org/stable/modules/preprocessing.html

[13] https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance

[14] https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption/