In the rest of this article, we’ll forecast taxi passenger demand in San Francisco, USA. We’ll tackle this problem as an OD flow count task.

The full code used in this tutorial is available on Github:

## Data set

We will use a data set collected by a taxi fleet in San Francisco, California, USA. The data set contains GPS data from 536 taxis over a period of 21 days. In total, there are 121 million GPS traces split across 464045 trips. You can check reference [1] for more details.

At each time step and for each taxi, we have information about its coordinates and whether a passenger occupies it.

## Defining the problem

Our goal is to model where people are moving to given their origin. OD flow count estimation can be split into four sub-tasks:

- Spatial grid decomposition
- Selection of origin-destination pairs
- Temporal discretization
- Modeling and forecasting

Let’s dive into each problem in turn.

## Spatial grid decomposition

Spatial decomposition is a common preprocessing step for OD flow count estimation. The idea is to split the map into grid cells, which represent a small part of the city. Then, we can count how many people traverse each possible pair of grid cells.

In this case study, we split the city map into 10000 grid cells as follows:

`import pandas as pd`from src.spatial import SpatialGridDecomposition, prune_coordinates

# reading the data set

trips_df = pd.read_csv('trips.csv', parse_dates=['time'])

# removing outliers from coordinates

trips_df = prune_coordinates(trips_df=trips_df, lhs_thr=0.01, rhs_thr=0.99)

# grid decomposition with 10000 cells

grid = SpatialGridDecomposition(n_cells=10000)

# setting bounding box

grid.set_bounding_box(lat=trips_df.latitude, lon=trips_df.longitude)

# grid decomposition

grid.grid_decomposition()

In the code above, we remove outlying locations. These can occur due to GPS malfunctions.

## Getting the most popular trips

After the spatial decomposition process, we get the origin and destination of each taxi trip when they’re occupied by a passenger.

`from src.spatial import ODFlowCounts`# getting origin and destination coordinates for each trip

df_group = trips_df.groupby(['cab', 'cab_trip_id'])

trip_points = df_group.apply(lambda x: ODFlowCounts.get_od_coordinates(x))

trip_points.reset_index(drop=True, inplace=True)

The idea is to reconstruct the data set to contain the following information: origin, destination, and origin timestamp of each passenger trip. This data forms the basis for our origin-destination (OD) flow count model.

This data allows us to count how many trips go from cell A to cell B:

`# getting the origin and destination cell centroid`

od_pairs = trip_points.apply(lambda x: ODFlowCounts.get_od_centroids(x, grid.centroid_df), axis=1)

For simplicity, we get the top 50 OD grid cell pairs with the most trips. Taking this subset is optional. Yet, OD pairs with only a few trips will show a sparse demand over time, which is difficult to model. Besides, trips with low demand may not be useful from a fleet management point of view.

`flow_count = od_pairs.value_counts().reset_index()`

flow_count = flow_count.rename({0: 'count'}, axis=1)top_od_pairs = flow_count.head(50)

## Temporal discretization

After finding the top OD pairs in terms of demand, we discretize these over time. This is done by counting how many trips occur in each hour for each given top pair. This can be done as follows:

`# preparing data`

trip_points = pd.concat([trip_points, od_pairs], axis=1)

trip_points = trip_points.sort_values('time_start')

trip_points.reset_index(drop=True, inplace=True)# getting origin-destination cells for each trip, and origin start time

trip_starts = []

for i, pair in top_od_pairs.iterrows():

origin_match = trip_points['origin'] == pair['origin']

dest_match = trip_points['destination'] == pair['destination']

od_trip_df = trip_points.loc[origin_match & dest_match, :]

od_trip_df.loc[:, 'pair'] = i

trip_starts.append(od_trip_df[['time_start', 'time_end', 'pair']])

trip_starts_df = pd.concat(trip_starts, axis=0).reset_index(drop=True)

# more data processing

od_count_series = {}

for pair, data in trip_starts_df.groupby('pair'):

new_index = pd.date_range(

start=data.time_start.values[0],

end=data.time_end.values[-1],

freq='H',

tz='UTC'

)

od_trip_counts = pd.Series(0, index=new_index)

for _, r in data.iterrows():

dt = r['time_start'] - new_index

dt_secs = dt.total_seconds()

valid_idx = np.where(dt_secs >= 0)[0]

idx = valid_idx[dt_secs[valid_idx].argmin()]

od_trip_counts[new_index[idx]] += 1

od_count_series[pair] = od_trip_counts.resample('H').mean()

od_df = pd.DataFrame(od_count_series)

This leads to a set of time series, one for each top OD pair. Here’s the time series plot for four example pairs:

The time series show a daily seasonality, which is mostly driven by rush hours.

## Forecasting

The set of time series that results from the temporal discretization can be used for forecasting. We can build a model to forecast how many passengers want to make the trip relative to a given OD pair.

Here’s how this can be done for an example OD pair:

`from pmdarima.arima import auto_arima`# getting the first OD pair as example

series = od_df[0].dropna()

# fitting an ARIMA model

model = auto_arima(y=series, m=24)

Above, we built a forecasting model based on ARIMA. The model forecasts passenger demand in the next hour given the recent past demand. We use an ARIMA method for simplicity, but other approaches such as deep learning can be used.