**Contents**hide

## How to use time series analysis and forecasting to tackle climate change

This is Part 2 of the series *Time Series for Climate Change. *List of articles:

Solar power is an increasingly prevalent source of clean energy.

Sunlight is converted into electricity by photovoltaic devices. Since these devices are not pollutants, they are considered a source of clean energy. Besides environmental benefits, solar power is also appealing due to its low cost. The initial investment is large, but the low long-term costs are worthwhile.

The amount of energy produced is determined by the level of solar radiation. But, solar conditions can change rapidly. For example, a cloud may unexpectedly cover the sun and decrease the efficiency of photovoltaic devices.

So, solar power systems rely on forecasting models to predict solar conditions. Like in the case of wind power, accurate forecasts have a direct impact on the effectiveness of these systems.

## Beyond energy production

Forecasting solar irradiance has other applications besides energy, for example:

- Agriculture: Farmers can leverage forecasts to optimize crop production. Instances include estimating when to plant or harvest a crop, or optimizing irrigation systems;
- Civil engineering: Forecasting solar irradiance is also valuable for designing and constructing buildings. Predictions can be used to maximize solar radiation, thereby reducing heating/cooling costs. Forecasts can also be useful to configure air-conditioning systems. This contributes to the efficient use of energy within buildings.

## Challenges, and what’s next

Despite its importance, solar conditions are highly variable and difficult to predict. These depend on several meteorological factors, whose information is sometimes unavailable.

In the rest of this article, we’ll develop a model for solar irradiance forecasting. Among other things, you’ll learn how to:

- visualize a multivariate time series;
- transform a multivariate time series for supervised learning;
- do feature selection based on correlation and importance scores.

This tutorial is based on a dataset collected by the U.S. Department of Agriculture. You can check more details in reference [1]. The full code for this tutorial is available on Github:

The data is a multivariate time series: at each instant, an observation is composed of several variables. These include the following weather and hydrological variables:

- Solar irradiance (watts per square meter);
- Wind direction;
- Snow depth;
- Wind speed;
- Dew point temperature;
- Precipitation;
- Vapor pressure;
- Relative humidity;
- Air temperature.

The series spans from October 1, 2007, to October 1, 2013. It’s collected at an hourly frequency totaling 52.608 observations.

After downloading the data, we can read it using pandas:

`import re`

import pandas as pd

# src module available here: https://github.com/vcerqueira/tsa4climate/tree/main/src

from src.log import LogTransformation# a sample here: https://github.com/vcerqueira/tsa4climate/tree/main/content/part_2/assets

assets = 'path_to_data_directory'

DATE_TIME_COLS = ['month', 'day', 'calendar_year', 'hour']

# we'll focus on the data collected at particular station called smf1

STATION = 'smf1'

COLUMNS_PER_FILE = \

{'incoming_solar_final.csv': DATE_TIME_COLS + [f'{STATION}_sin_w/m2'],

'wind_dir_raw.csv': DATE_TIME_COLS + [f'{STATION}_wd_deg'],

'snow_depth_final.csv': DATE_TIME_COLS + [f'{STATION}_sd_mm'],

'wind_speed_final.csv': DATE_TIME_COLS + [f'{STATION}_ws_m/s'],

'dewpoint_final.csv': DATE_TIME_COLS + [f'{STATION}_dpt_C'],

'precipitation_final.csv': DATE_TIME_COLS + [f'{STATION}_ppt_mm'],

'vapor_pressure.csv': DATE_TIME_COLS + [f'{STATION}_vp_Pa'],

'relative_humidity_final.csv': DATE_TIME_COLS + [f'{STATION}_rh'],

'air_temp_final.csv': DATE_TIME_COLS + [f'{STATION}_ta_C'],

}

data_series = {}

for file in COLUMNS_PER_FILE:

file_data = pd.read_csv(f'{assets}/{file}')

var_df = file_data[COLUMNS_PER_FILE[file]]

var_df['datetime'] = \

pd.to_datetime([f'{year}/{month}/{day} {hour}:00'

for year, month, day, hour in zip(var_df['calendar_year'],

var_df['month'],

var_df['day'],

var_df['hour'])])

var_df = var_df.drop(DATE_TIME_COLS, axis=1)

var_df = var_df.set_index('datetime')

series = var_df.iloc[:, 0].sort_index()

data_series[file] = series

mv_series = pd.concat(data_series, axis=1)

mv_series.columns = [re.sub('_final.csv|_raw.csv|.csv', '', x) for x in mv_series.columns]

mv_series.columns = [re.sub('_', ' ', x) for x in mv_series.columns]

mv_series.columns = [x.title() for x in mv_series.columns]

mv_series = mv_series.astype(float)

This code leads to the following data set:

## Exploratory data analysis

The series plot suggests there’s a strong yearly seasonality. Radiation levels peak during summertime, and other variables show similar patterns. Apart from seasonal fluctuations, the level of the time series is stable over time.

We can also visualize the solar irradiance variable individually:

Besides the clear seasonality, we can also spot some downward spikes around the level of the series. These cases need to be predicted timely so that backup energy systems are used efficiently.

We can also analyze the correlation between each pair of variables:

Solar irradiance is correlated with some of the variables. For example, air temperature, relative humidity (negative correlation), or wind speed.

We’ve explored how to build a forecasting model with a univariate time series in a previous article. Yet, the correlation heatmap suggests that it may be valuable to include these variables in the model.

How can we do that?

## Primer on Auto-Regressive Distributed Lags modeling

Auto-regressive distributed lags (ARDL) is a modeling technique for multivariate time series.

ARDL is a useful approach to identifying the relationship between several variables over time. It works by extending the auto-regression technique to multivariate data. The future values of a given variable of the series are modeled based on its lags and the lags of other variables.

In this case, we want to forecast solar irradiance based on the lags of several factors such as air temperature or vapor pressure.

## Transforming the data for ARDL

Applying the ARDL method involves transforming the time series into a tabular format. This is done by applying time delay embedding to each variable, and then concatenating the results into a single matrix. The following function can be used to do this:

`import pandas as pd`def mts_to_tabular(data: pd.DataFrame,

n_lags: int,

horizon: int,

return_Xy: bool = False,

drop_na: bool = True):

"""

Time delay embedding with multivariate time series

Time series for supervised learning

:param data: multivariate time series as pd.DataFrame

:param n_lags: number of past values to used as explanatory variables

:param horizon: how many values to forecast

:param return_Xy: whether to return the lags split from future observations

:return: pd.DataFrame with reconstructed time series

"""

# applying time delay embedding to each variable

data_list = [time_delay_embedding(data[col], n_lags, horizon)

for col in data]

# concatenating the results in a single dataframe

df = pd.concat(data_list, axis=1)

if drop_na:

df = df.dropna()

if not return_Xy:

return df

is_future = df.columns.str.contains('\+')

X = df.iloc[:, ~is_future]

Y = df.iloc[:, is_future]

if Y.shape[1] == 1:

Y = Y.iloc[:, 0]

return X, Y

This function is applied to the data as follows:

`from sklearn.model_selection import train_test_split`# target variable

TARGET = 'Solar Irradiance'

# number of lags for each variable

N_LAGS = 24

# forecasting horizon for solar irradiance

HORIZON = 48

# leaving the last 30% of observations for testing

train, test = train_test_split(mv_series, test_size=0.3, shuffle=False)

# transforming the time series into a tabular format

X_train, Y_train_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)

X_test, Y_test_all = mts_to_tabular(train, N_LAGS, HORIZON, return_Xy=True)

# subsetting the target variable

target_columns = Y_train_all.columns.str.contains(TARGET)

Y_train = Y_train_all.iloc[:, target_columns]

Y_test = Y_test_all.iloc[:, target_columns]

We set the forecasting horizon to 48 hours. Predicting many steps in advance is valuable for the effective integration of several energy sources into the electricity grid.

It’s difficult to say a priori how many lags should be included. So, this value is set to 24 for each variable. This leads to a total of 216 lag-based features.

## Building a forecasting model

Before building a model, we extract 8 more features based on the date and time. These include data such as the day of the year or hour which are useful to model seasonality.

We reduce the number of explanatory variables with feature selection. First, we apply a correlation filter. This is used to remove any feature with a correlation greater than 95% with any other explanatory variable. Then, we also apply recursive feature elimination (RFE) based on the importance scores of a Random Forest. After feature engineering, we train a model using a Random Forest.

We leverage sklearn’s *Pipeline *and *RandomSearchCV *to optimize the parameters of the different steps:

`from sklearn.pipeline import Pipeline`

from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import RandomizedSearchCV

from sktime.transformations.series.date import DateTimeFeaturesfrom src.holdout import Holdout

# including datetime information to model seasonality

hourly_feats = DateTimeFeatures(ts_freq='H',

keep_original_columns=True,

feature_scope='efficient')

# building a pipeline

pipeline = Pipeline([

# feature extraction based on datetime

('extraction', hourly_feats),

# removing correlated explanatory variables

('correlation_filter', FunctionTransformer(func=correlation_filter)),

# applying feature selection based on recursive feature elimination

('select', RFE(estimator=RandomForestRegressor(max_depth=5), step=3)),

# building a random forest model for forecasting

('model', RandomForestRegressor())]

)

# parameter grid for optimization

param_grid = {

'extraction': ['passthrough', hourly_feats],

'select__n_features_to_select': np.linspace(start=.1, stop=1, num=10),

'model__n_estimators': [100, 200]

}

# optimizing the pipeline with random search

model = RandomizedSearchCV(estimator=pipeline,

param_distributions=param_grid,

scoring='neg_mean_squared_error',

n_iter=25,

n_jobs=5,

refit=True,

verbose=2,

cv=Holdout(n=X_train.shape[0]),

random_state=123)

# running random search

model.fit(X_train, Y_train)

# checking the selected model

model.best_estimator_

# Pipeline(steps=[('extraction',

# DateTimeFeatures(feature_scope='efficient', ts_freq='H')),

# ('correlation_filter',

# FunctionTransformer(func=<function correlation_filter at 0x28cccfb50>)),

# ('select',

# RFE(estimator=RandomForestRegressor(max_depth=5),

# n_features_to_select=0.9, step=3)),

# ('model', RandomForestRegressor(n_estimators=200))])

## Evaluating the model

We selected a model using a random search coupled with a validation split. Now, we can evaluate its forecasting performance on the test set.

`# getting forecasts for the test set`

forecasts = model.predict(X_test)

forecasts = pd.DataFrame(forecasts, columns=Y_test.columns)

The selected model kept only 65 out of the original 224 explanatory variables. Here’s the importance of the top 20 features:

The features hour of the day and day of the year are among the top 4 features. This result highlights the strength of seasonal effects in the data. Besides those, the first lags of some of the variables are also useful to the model.