So, what is PAD?
PAD, the Progressive Action Database, is CTA’s core data warehouse and integrations platform. Think of PAD (yes, like where a frog lives) as the jumping-off point for all things data.
PAD brings together syncs from tools like EveryAction and OpenField, as well as voter data from Catalist and TargetSmart. We also integrate geocodes, digital ad analytics, and so much more.
When we set out to build PAD we had three core goals in mind (and have stayed true to these):
- Make data accessible
- Allow for sustainable growth as technology changes and evolves
- Leverage cutting-edge tooling.
That’s why so much of our stack is built on open-source tech and the powerful Google Cloud Platform. Rather than build from scratch, we leverage already-existing tooling for PAD, leaning hard on Google’s data infrastructure cloud offerings. Our goal is to pull together tools and features to provide our partners with all the same tools we use internally to develop data pipelines and manage data infrastructure.
Here’s a deeper look inside what’s new (and still true) in PAD:
PAD has the ability to offer large storage for large amounts of frequently changing data, at very little cost and latency.
Storing and querying data based in PAD is easier and cheaper than in other data warehouses because of features BigQuery makes available — and that CTA uses extensively!
Depending on the use case, CTA recommends one of several different materializations for your data. BigQuery offers caching, scheduled queries, views, and materialized views. Additionally, tables can be copied (or cloned) with zero latency.
For non-technical users of PAD, we leverage Google’s BigQuery to Connected Sheets functionality to provide direct data access to perform basic analytical functions, without requiring technical expertise or knowledge of querying languages like SQL.
BigQuery offers out-of-the-box scheduling for SQL queries. You can schedule queries through the Google Cloud Console, BigQuery CLI tool, or through Google’s SDKs. Scheduled Queries use BigQuery’s native parameterization, so you can include date strings and other values in your queries. Additionally, you can schedule queries as a service account instead of as yourself, so that queries run with a single user that has access to all of the resources it needs.
Protip: If you’re a PAD user, we provision a service account for this use-case! Email firstname.lastname@example.org for more info.
Google Workflows is one of the (many) services offered in Google Cloud Platform. It enables the user (like you!) to easily orchestrate processes using a simple YAML file. It’s a serverless solution that you can deploy from the Google Console, CLI or API.
It is particularly useful when you are performing tasks that use the Google API and Google Cloud Services. Rather than needing to write and deploy custom code to handle different APIs, Google Workflows makes it simple to configure tasks and set up dependencies between them.
We use Google Workflows for many of our orchestration and pipeline creation use cases.
Read more about how CTA implemented Workflows to move data between S3 and Google Drive into BigQuery.
Advanced PAD users can leverage Google’s BigQuery CLI and other Google SDKs. CTA uses the Python and Node.JS SDKs internally to support everything we do, from automation of onboarding users, scheduling workflows and SQL, and managing authentication, authorization, and more.
Parameterized Queries allows for easy-to-use SQL injection. You can easily leverage query parameterization via the BQ CLI. Injecting parameters via the BQ CLI tools is as easy as using variables and using standard shell interpolation. Here’s an example looking at name popularity using BigQuery’s public data:
$ bq query \
--parameter='states:ARRAY<STRING>:["WA", "WI", "WV", "WY"]' \
SUM(number) AS count
gender = @gender
AND state IN UNNEST(@states)
If you want to learn more about PAD or think it’s the data management system for you, email us at email@example.com and we’ll chat!