Window Frame
Time to pull in a new dataset, introducing Chicago taxies! This is another public dataset (CC0 license) you can use if you want to experiment. (link)
bigquery-public-data.chicago_taxi_trips.taxi_trips
The public dataset is large at 75GB, quickly eating into your free 100GB monthly query allowance. Therefore, I’ve created a new table that only holds data from 2023, so we can play around with the data without racking up a hefty bill.
I’ve made this table public, so I recommend you try my dataset for testing.
spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data
Anyway, back to the subject…what is a window frame? This clause allows us to define what rows or ranges we need to use inside the partition. A popular use case for this is to create moving averages.
SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
This query provides us with revenue by date between May and June 2023.
Moving averages are very common with time series data, as it allows you to easily compare the performance of a specific day or event month to what result you typically see for a given period.
First, let’s create a simple moving average and to save repeating date conversions and revenue rounding, I’ve put our initial query inside a CTE.
WITH daily_data as(
SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
)SELECT
trip_start_date,
trip_total_revenue,
avg(trip_total_revenue) over (order by trip_start_date asc) as moving_average
FROM
daily_data
If we look at the first five rows, we can see that the first average equals trip_total_revenue. This is because it’s the start of the window as we’ve ordered our data by trip_start_date in asc order. Therefore there isn’t anything to average against yet.
However, we now have a daily average between rows 1 and 2 for the second row. We have a daily average for the third row between rows 1, 2 and 3.
It’s a good start which shows us that our moving average is working, but let’s take it a step further. Let’s create a moving average which only includes the last seven days of revenue, and if the window doesn’t contain seven days, then show a null value.
To specify your window range, there are three keywords you need to remember:
- current row
- preceding
- following
You then construct your window starting with either rows or ranges (I’ll explain the difference between the two later), followed by between <<start>> and <<end>>.
rows between 7 preceding and one preceding
The example above is the window frame we need for our problem. We’ve specified the window to begin seven rows before the current row and end one before the current row.
Here’s a simple example of how this window framework with a sum aggregate.
select
numbers,
sum(numbers) over
(
order by numbers asc
rows between 7 preceding and one preceding
)
as moving_sum_seven
from
test_data
As you can see, when we reach the 8th row, the value of the moving sum reaches 7, where the window now contains seven rows of data. If you switch the window to 6 preceding and current rows, you’ll see the window has shifted to include the current row.
At the end of this section, I’ll provide some use case examples to highlight how they can be used, but back to the task at hand for now!
Let’s put that window range into our moving average.
with daily_data as (
SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
)SELECT
trip_start_date,
trip_total_revenue,
avg(trip_total_revenue) over (order by trip_start_date asc rows between 7 preceding and one preceding) as moving_average
FROM
daily_data
ORDER BY
trip_start_date DESC
Now we have one final challenge, how do we make the value null if the window contains less than seven rows of data? Well, we can use an IF statement to check.
if
(
COUNT(*) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) = 7,
AVG(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING),
NULL
) AS moving_average
We’ve brought in a second window function that counts how many rows exist in the window frame, which, if equal to 7, will provide the moving average result.