How CTA Uses dbt
By Emily Cogsdill
When it is done well, Data Engineering is a bit like ballet — it appears effortless, as though the data never had any choice but to transform alchemy-like into normalized and sensibly data-typed tables in a data warehouse. And, much like ballet, the reality is challenging in ways that (ideally) never meet the eye. So join us as we peel back the curtain and show you some of the sweat that goes into our beautiful — and never effortless — machinery.
At Community Tech Alliance, our daily dance takes the form of grabbing lots of data from lots of places and delivering it for our users in PAD in a useful and usable format.
Those final six words — “in a useful and usable format” — are essential to our work, and fulfilling them in a reliable and scalable fashion is one of the core technical challenges we grapple with on a daily basis. In order to pull off this feat, we make extensive use of Data Build Tool, or dbt.
dbt is a tool that allows you to create complex data models in a modular fashion, solving the problem of the absurdly long SQL files that many of us have used at some point to run our data pipelines. We’ve all seen (and written) these files — “facebook.sql,” for example, might be 1000 lines of code in which we stage this data, stage this other data, join these things, and use a window function to take only the most recent data for each primary key. These files have always been the stuff of nightmares, and they are challenging to write, maintain, and collaborate on.
dbt allows us to construct those complex pipelines using SQL files that act like building blocks to ultimately build the final table in a file structure that a human can more readily read and understand.
For example, think about one of those hideously long SQL files — let’s say it starts with 5 CTEs, 3 staging tables, and 1 final table. dbt lets us construct that same pipeline using separate files — in this example, we would have 5 SQL files in a folder called `ctes`, 3 files in `staging` that query from the models in `ctes`, and 1 file in `final` that runs a query against the tables created by `staging`. Modeling data with this approach makes it extremely easy to develop and debug any component of the overall pipeline, rather than scrolling around and trying to remind yourself what each chunk of code is supposed to do (we’ve all been there!).
dbt is straightforward to run locally and on a small scale. At CTA, our data pipelines are run in Google Cloud Platform, and they need to be constructed to easily accommodate any number of pipelines we develop, and for any number of users we need to deliver data to. In other words: not local, and definitely not on a small scale. How do we handle this? CTA solves this scalability problem by running our entire ELT pipeline — in which dbt performs both the loading and transformation of the data — in DAGs deployed to Google Composer (which is Airflow managed by GCP).
One great feature of dbt is that much of the configuration can be abstracted away using variables. This means that we can run the same dbt models for any number of partners by simply passing in different variables. This also means that we can keep those values in a private repository and share the dbt used in all vendor syncs in a public repository. Anything you see in that public repo is a direct representation of the actual dbt we are running for all our syncs. So if you are a PAD user and want to know how each of your tables comes into being, you can go and see for yourself! As the number of PAD users grow, and as engineers in the space continue to gain experience with dbt, we hope others will collaborate on the repository by submitting issues, feature requests, and even code submissions. Having our dbt models in a public repository also makes our work available for the benefit of anyone who wants to run them, whether or not they are PAD users or affiliated with CTA in any way.
At CTA, we strive to make Data Engineering seem effortless, yet the reality is far from it. We grapple daily with the challenge of delivering data in a useful and usable format, and that’s why we rely on dbt to construct complex pipelines with ease. Our scalable approach with dbt allows us to run our entire ELT pipeline in DAGs deployed to Google Composer. And with much of the configuration abstracted away using variables, we can share our dbt models in a public repository for anyone to benefit from. So come dance with us and see for yourself how our beautiful — and never effortless — machinery works!