## Window Frame

Time to pull in a new dataset, introducing **Chicago taxies**! This is another public dataset (CC0 license) you can use if you want to experiment. (link)

`bigquery-public-data.chicago_taxi_trips.taxi_trips`

The public dataset is large at 75GB, quickly eating into your free 100GB monthly query allowance. Therefore, I’ve created a new table that only holds data from 2023, so we can play around with the data without racking up a hefty bill.

I’ve made this table public, so I recommend you try my dataset for testing.

`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`

Anyway, back to the subject…what is a window frame? This clause allows us to define what rows or ranges we need to use inside the partition. A popular use case for this is to create moving averages.

`SELECT`

date(trip_start_timestamp) as trip_start_date,

round(sum(trip_total),2) as trip_total_revenue

FROM

`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`

WHERE

date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"

GROUP BY

trip_start_date

ORDER BY

trip_start_date DESC

This query provides us with revenue by date between May and June 2023.

Moving averages are very common with time series data, as it allows you to easily compare the performance of a specific day or event month to what result you typically see for a given period.

First, let’s create a simple moving average and to save repeating date conversions and revenue rounding, I’ve put our initial query inside a CTE.

`WITH daily_data as(`

SELECT

date(trip_start_timestamp) as trip_start_date,

round(sum(trip_total),2) as trip_total_revenue

FROM

`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`

WHERE

date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"

GROUP BY

trip_start_date

ORDER BY

trip_start_date DESC

)SELECT

trip_start_date,

trip_total_revenue,

avg(trip_total_revenue) over (order by trip_start_date asc) as moving_average

FROM

daily_data

If we look at the first five rows, we can see that the first average equals trip_total_revenue. This is because it’s the start of the window as we’ve ordered our data by trip_start_date in asc order. Therefore there isn’t anything to average against yet.

However, we now have a daily average between rows 1 and 2 for the second row. We have a daily average for the third row between rows 1, 2 and 3.

It’s a good start which shows us that our moving average is working, but let’s take it a step further. Let’s create a moving average which only includes the last seven days of revenue, and if the window doesn’t contain seven days, then show a null value.

To specify your window range, there are three keywords you need to remember:

- current row
- preceding
- following

You then construct your window starting with either rows or ranges (I’ll explain the difference between the two later), followed by between <<start>> and <<end>>.

`rows between 7 preceding and one preceding`

The example above is the window frame we need for our problem. We’ve specified the window to begin seven rows before the current row and end one before the current row.

Here’s a simple example of how this window framework with a sum aggregate.

`select`

numbers,

sum(numbers) over

(

order by numbers asc

rows between 7 preceding and one preceding

)

as moving_sum_seven

from

test_data

As you can see, when we reach the 8th row, the value of the moving sum reaches 7, where the window now contains seven rows of data. If you switch the window to 6 preceding and current rows, you’ll see the window has shifted to include the current row.

At the end of this section, I’ll provide some use case examples to highlight how they can be used, but back to the task at hand for now!

Let’s put that window range into our moving average.

`with daily_data as (`

SELECT

date(trip_start_timestamp) as trip_start_date,

round(sum(trip_total),2) as trip_total_revenue

FROM

`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`

WHERE

date(trip_start_timestamp) between "2023-05-01" and "2023-06-30"

GROUP BY

trip_start_date

ORDER BY

trip_start_date DESC

)SELECT

trip_start_date,

trip_total_revenue,

avg(trip_total_revenue) over (order by trip_start_date asc rows between 7 preceding and one preceding) as moving_average

FROM

daily_data

ORDER BY

trip_start_date DESC

Now we have one final challenge, how do we make the value null if the window contains less than seven rows of data? Well, we can use an IF statement to check.

` if`

(

COUNT(*) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) = 7,

AVG(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING),

NULL

) AS moving_average

We’ve brought in a second window function that counts how many rows exist in the window frame, which, if equal to 7, will provide the moving average result.