When I was researching the Ultimate Guide to dbt, I was shocked by the lack of material around actually building models from scratch. Not the exact steps to take in the tool — that is all covered in innumerable blogs and tutorials. I mean how do you know the right design? How do you make sure your stakeholders will use that model? How can you make sure it will be trusted and understood?
When we deploy new models without taking these steps, there can be significant consequences:
- we face a deluge of questions and follow-up requests from stakeholders
- we get suggestions for code improvements from other Data Engineers or Analytics Engineers
- we have to go back and add in the new features, make the improvements, and answer all the questions before our work is done
If we repeat this process over and over trust between data and business teams begins to deteriorate as each side gets progressively more exhausted from this feedback frenzy, something that can be very challenging to build back up.
This underscores the importance of thinking carefully about how we design models, not just on our own in dbt, but collectively with all of our stakeholders, to make sure the model is accurate and effective, and we don’t waste our time building each model 4–5 times before its useful.
This article is the result of research and experiments into how best to design and implement a dbt model. It won’t have any commands to execute in dbt, but it will talk through how to think about your model, and how to structure your workflow to make sure you’re not wasting your time.
Lucky for me, I’m not the first to think about this problem. Many other fields have faced similar challenges and have created their own frameworks and processes that I can leverage when thinking about how to approach data modeling. For example:
Agile principles discourage software engineers from a waterfall development approach which is antithetical to an environment of rapidly-changing requirements [1]. Instead, Agile embraces rapid iteration and acknowledges the competitive advantage of being able to respond to changing requirements quickly.
Design principles similarly acknowledge the need to be deliberate about how you work with multiple stakeholders on a design project [2]. The framework prioritizes people and encourages feedback and each stage of development so the best solution can be found as quickly as possible.
Even the data modeling godfather Ralph Kimball nods to the importance of getting quality input from stakeholders early in a modeling process in his 4-step process to data modeling [3]. Step 1 of which is to go learn as much about the business process as you can before you even think about building a model.
However, the most influential source I found when thinking about this problem was the System Engineering Heuristics — a set of truisms about working on a complex problem with many stakeholders [4]:
- Don’t assume that the original statement of the problem is necessarily the best, or even the right one.
- In the early stages of a project, unknowns are a bigger issue than known problems.
- Model before build, wherever possible.
- Most of the serious mistakes are made early on.
These sources helped shape the following process for designing data models from scratch.
And so I wanted to build a process that was true to those principles, that was repeatable, and that would actually make sure my models were built well the first time.
Here’s what I came up with: