Analytic functions provide an incredibly powerful and yet easy to implement way to process and analyze data. This post will show you how to incorporate analytic functions in your SQL statements.
As an analytics professional, you will likely find yourself in a scenario where you’re required to query data for your analysis. Quite often, data is obtained from a SQL database and then imported through a programming language like Python using powerful frameworks such as Pandas or NumPy. This is a perfectly fine pipeline to work with data, however, the heavy lifting is mainly done by your local machine. With small datasets, this is not an issue, but when it comes to larger datasets, one might encounter issues achieving the heavy processing solely in the PC’s local memory.
You may think that this is not a common problem. Therefore, let me give an everyday example to prove this assumption wrong:
Imagine you worked in a manufacturing company and you were interested in collecting machine sensor data. This data is often collected frequently and might also be quite noisy. To better understand what’s going on with your machine, smoothing, and pre-processing the densely collected data (e.g. measurements could be collected several times per second) quickly results in immense dataset sizes! Let’s assume we have 150 sensors placed alongside a machine, and each one of them is reading 4 measurements per second. One single day would then produce
4×60×60×24×150 ≈ 52m records (Readings x Seconds x Minutes x Hours x Sensors)
data points. Usually, as a rule of thumb, we look at at least one week of data (but we might also increase readings or even the number of sensors)… you see where this is going.
For this reason, you might be better off shifting your computationally expensive aggregations to the source database. In particular, analytic or window functions are straightforward in its syntax, yet a powerful tool to read, transform, and extract data on a more aggregated level.
The key take away:
Whenever you see the need for a rolling / moving window or calculations within a logical partition (e.g. continously ranking, lowest ro highest value, within a certain group of sensors), it is certainly worthwhile…