Time Series for Climate Change: Origin-Destination Demand Forecasting | by Vitor Cerqueira | Jun, 2023

In the rest of this article, we’ll forecast taxi passenger demand in San Francisco, USA. We’ll tackle this problem as an OD flow count task.

The full code used in this tutorial is available on Github:

Data set

We will use a data set collected by a taxi fleet in San Francisco, California, USA. The data set contains GPS data from 536 taxis over a period of 21 days. In total, there are 121 million GPS traces split across 464045 trips. You can check reference [1] for more details.

Sample of the data set.

At each time step and for each taxi, we have information about its coordinates and whether a passenger occupies it.

Defining the problem

End location of a sample of taxi trips in San Francisco, USA. Image by author.

Our goal is to model where people are moving to given their origin. OD flow count estimation can be split into four sub-tasks:

  1. Spatial grid decomposition
  2. Selection of origin-destination pairs
  3. Temporal discretization
  4. Modeling and forecasting

Let’s dive into each problem in turn.

Spatial grid decomposition

Spatial decomposition is a common preprocessing step for OD flow count estimation. The idea is to split the map into grid cells, which represent a small part of the city. Then, we can count how many people traverse each possible pair of grid cells.

Two example grid cells in San Francisco. Image by author.

In this case study, we split the city map into 10000 grid cells as follows:

import pandas as pd

from src.spatial import SpatialGridDecomposition, prune_coordinates

# reading the data set
trips_df = pd.read_csv('trips.csv', parse_dates=['time'])

# removing outliers from coordinates
trips_df = prune_coordinates(trips_df=trips_df, lhs_thr=0.01, rhs_thr=0.99)

# grid decomposition with 10000 cells
grid = SpatialGridDecomposition(n_cells=10000)
# setting bounding box
grid.set_bounding_box(lat=trips_df.latitude, lon=trips_df.longitude)
# grid decomposition

In the code above, we remove outlying locations. These can occur due to GPS malfunctions.

Getting the most popular trips

After the spatial decomposition process, we get the origin and destination of each taxi trip when they’re occupied by a passenger.

from src.spatial import ODFlowCounts

# getting origin and destination coordinates for each trip
df_group = trips_df.groupby(['cab', 'cab_trip_id'])
trip_points = df_group.apply(lambda x: ODFlowCounts.get_od_coordinates(x))
trip_points.reset_index(drop=True, inplace=True)

The idea is to reconstruct the data set to contain the following information: origin, destination, and origin timestamp of each passenger trip. This data forms the basis for our origin-destination (OD) flow count model.

This data allows us to count how many trips go from cell A to cell B:

# getting the origin and destination cell centroid
od_pairs = trip_points.apply(lambda x: ODFlowCounts.get_od_centroids(x, grid.centroid_df), axis=1)

For simplicity, we get the top 50 OD grid cell pairs with the most trips. Taking this subset is optional. Yet, OD pairs with only a few trips will show a sparse demand over time, which is difficult to model. Besides, trips with low demand may not be useful from a fleet management point of view.

flow_count = od_pairs.value_counts().reset_index()
flow_count = flow_count.rename({0: 'count'}, axis=1)

top_od_pairs = flow_count.head(50)

Temporal discretization

After finding the top OD pairs in terms of demand, we discretize these over time. This is done by counting how many trips occur in each hour for each given top pair. This can be done as follows:

# preparing data
trip_points = pd.concat([trip_points, od_pairs], axis=1)
trip_points = trip_points.sort_values('time_start')
trip_points.reset_index(drop=True, inplace=True)

# getting origin-destination cells for each trip, and origin start time
trip_starts = []
for i, pair in top_od_pairs.iterrows():

origin_match = trip_points['origin'] == pair['origin']
dest_match = trip_points['destination'] == pair['destination']

od_trip_df = trip_points.loc[origin_match & dest_match, :]
od_trip_df.loc[:, 'pair'] = i

trip_starts.append(od_trip_df[['time_start', 'time_end', 'pair']])

trip_starts_df = pd.concat(trip_starts, axis=0).reset_index(drop=True)

# more data processing
od_count_series = {}
for pair, data in trip_starts_df.groupby('pair'):

new_index = pd.date_range(

od_trip_counts = pd.Series(0, index=new_index)
for _, r in data.iterrows():
dt = r['time_start'] - new_index
dt_secs = dt.total_seconds()

valid_idx = np.where(dt_secs >= 0)[0]
idx = valid_idx[dt_secs[valid_idx].argmin()]

od_trip_counts[new_index[idx]] += 1

od_count_series[pair] = od_trip_counts.resample('H').mean()

od_df = pd.DataFrame(od_count_series)

This leads to a set of time series, one for each top OD pair. Here’s the time series plot for four example pairs:

Time series of flow counts for four example origin-destination pairs. Image by author.

The time series show a daily seasonality, which is mostly driven by rush hours.


The set of time series that results from the temporal discretization can be used for forecasting. We can build a model to forecast how many passengers want to make the trip relative to a given OD pair.

Here’s how this can be done for an example OD pair:

from pmdarima.arima import auto_arima

# getting the first OD pair as example
series = od_df[0].dropna()

# fitting an ARIMA model
model = auto_arima(y=series, m=24)

Above, we built a forecasting model based on ARIMA. The model forecasts passenger demand in the next hour given the recent past demand. We use an ARIMA method for simplicity, but other approaches such as deep learning can be used.

Source link

Leave a Comment