Seemingly easy, but in reality hard
Running a single dbt command in production from start to finish is satisfyingly simple. But if you’ve ever managed your own orchestration infrastructure, you know all too well the hard part is in the maintenance of the system, not in the proof of feasibility.
The decision to build in house is often one of the riskiest ones for companies. At the outset, the journey looks easy; “How hard can it be? We already have the orchestrator up and running.” The real operational tax though is less understood and always looks different months after the decision because delivering consistent reliability and performance at scale is hard, and software breaks in fabulously, frustrating ways.
How do I know this? Because my team builds dbt Cloud’s Scheduler, and we execute over 10M runs every month for our customers, keeping pace with a run volume that grows incredibly quickly. By serving the dbt scheduling needs for 8k+ accounts, we’ve nearly seen it all. We have a team actively investing in our scheduling and logs infrastructure, so that you don’t have to, and we’re committed to building tailor-made experiences, suited for the dbt deployment workflow.
We’ve introduced some meaningful improvements to our Scheduler and logs in the past couple months, and we want to share the wins with you:
- Faster run starts: dbt Cloud jobs at the top of the hour begin execution 75% faster compared to the start of the year.
- Unlimited job concurrency (enterprise, multi-tenant plan only): Want to run 50 jobs at the same time? Go wild.
- Smarter cancellation for over scheduled jobs: Have a job that’s scheduled faster than it can complete? dbt Cloud will automatically cancel unnecessary runs (and let you know).
- More usable run logs: Discover errors and begin remediation faster.
dbt Cloud’s Scheduler handles the edge cases, so you don’t have to. With retries on connection to the warehouse, repository cloning, and pod memory scaling, we keep your dbt Cloud jobs running on time.
Introducing faster run starts
The Scheduler takes care of preparing each dbt Cloud job to run in your cloud data platform. This prep involves readying a Kubernetes pod with the right version of dbt installed, setting environment variables, loading data platform credentials, and git provider authorization, amongst other environment-setting tasks. Only after the environment is set up, can dbt execution begin. We display this time to the user in dbt Cloud as “prep time”.
For all its strengths, Kubernetes comes with a lot of challenges, and spinning up pods and spinning down pods were a meaningful drag on the total run execution time. We’ve rebuilt our scheduler to now have a ready pool of pods to execute customers’ jobs. You’ll no longer see long prep times at the top of the hour, and we’re determined to keep runs starting near instantaneously. Don’t just take our word for it; here’s some data.
Jobs scheduled at the top of the hour have the longest prep time because of the volume of runs the scheduler has to process. The scheduler used to take north of 106 seconds to prep the slowest 1% of all runs. Even though the volume of runs has grown significantly since the beginning of the year, the scheduler now preps runs, at worst, in 27 seconds. That’s a 75% speed up of prep time for runs at peak traffic times!
Introducing unlimited job concurrency for our Enterprise plan
Our scheduler is more durable than ever, and we want our customers to lean into running as many jobs as they need to for their business. On our enterprise, multitenant plan, customers can now benefit from unlimited job concurrency. Previously, enterprises accessed a fixed number of run slots, but now they are unconstrained. Team plan customers will continue to have only 2 run slots. Each running job occupies a run slot for the duration of the run, and if a run slot is not available the job will wait in a run queue.
Introducing smarter cancellation for over scheduled jobs
The dbt Cloud Scheduler now prevents queue clog by canceling unnecessary runs of over-scheduled jobs.
The duration of a job run tends to grow over time, usually caused by growing amounts of data in the warehouse. If the run duration becomes longer than the frequency of the job’s schedule, the queue will grow faster than the scheduler can process the job’s runs, leading to a runaway queue with runs that don’t need to be processed.
Previously, when a job was in this over-scheduled state, the scheduler would stop queuing runs after 50 were already in the queue. This led to a poor user experience in which the scheduler canceled runs indiscriminately.
Now, the dbt Cloud scheduler detects when a scheduled job is set to run too frequently and appropriately cancels runs that don’t need to be processed. Specifically, scheduled jobs can only ever have one run of the job in the queue, and if a more recent run gets queued, the early queued run will get canceled with a helpful error message.
Introducing more usable run logs
Triaging errors in logs is a big benefit of using dbt Cloud’s job and scheduler functionality. We want to make the process of finding the root cause of errors even easier. Recently we shipped some changes to aid with that goal, including:
- surfacing a warn state on a run step,
- allowing users to search in logs,
- easier navigation of errors and warnings in logs, and
- exposing hyperlinks in the logs to the compiled SQL artifacts
This is just the beginning, and we’ll continue to make improvements so you can do your best investigative work.
The journey ahead
We’re committed to making dbt Cloud the easiest way to run your dbt project in production. We’re continuing to build experiences that our customers care about. We want to help you keep production data fresh on a timely basis, ensure your production pipelines are efficient, and identify the root cause of failures in deployment environments more easily.
To that end, we’re working on bringing new capabilities to the dbt Cloud scheduler and jobs. A couple experiences that we’re excited to deliver by Coalesce in October 2023:
- the ability to retry a run from the point of failure and
- the ability to chain jobs together, so one job runs when another one completes.
We LOVE hearing from our users. If you have orchestration needs that we can better support in dbt Cloud, shoot me a note at email@example.com. No feedback is too small.
Last modified on: Nov 22, 2023