February 2022: Update on recent dbt Cloud incidents
We want to address a couple of challenges we know you’ve run up against recently with dbt Cloud. Since last Monday, we’ve had an ongoing incident related to slowness starting up the dbt Cloud IDE. Last Thursday, we had a second, related incident where the IDE was temporarily unavailable. Some of you have been feeling pain from related availability incidents earlier in the year.
So first, sincere apologies from all of us at dbt Labs. We understand how critical the IDE is to your ability to get work done day-to-day. You expect this product to be up and available for all of you when you need it, whether it be a few minutes to get a quick PR out, or eight hours a day getting work done with dbt.
Second, to all of you who have written to us and shared your experience — a sincere thank you. I have been reading through all of your feedback and sharing it widely with the Product team and the Engineers building dbt. We are listening, and we are working on this.
You might have seen myself and David Harting share some of our recent incidents and future plans with you in the dbt Community Slack’s #dbt-cloud channel. Today, I’d like to give you a more detailed overview of what happened, why it is happening, what we plan to do to improve things, and how you can continue to share your dbt Cloud experience with us to make sure we don’t miss anything!
An incident timeline
I want to rewind back to January of this year to replay major incidents and pull out common themes, and then lay out what we’re planning to do about them.
- Jan 14 — a code change in the IDE intended to improve performance caused us to perform git actions too quickly, resulting in contention and errors. To address this, we made changes to enable us to roll IDE code versions forward and back more seamlessly.
- Jan 16 — an outage in the CDN that powers the dbt Package Hub index triggered job failures in dbt Cloud. To address this, we have improved
dbt depsretry handling and are investigating hosting options with provably better uptime and monitoring.
- Jan 17 thru Jan 26 — we released a change to IDE health checks that added two minutes to IDE startup time. We reverted the change and prioritized monitoring and alerting of IDE service degradation on our roadmap.
- Feb 7 — we deployed a malformed configuration change that damaged availability for customers using GitHub hosted repositories. To address this, we have implemented new policies for safe configuration changes and added monitoring to proactively notify customers of integration impacts.
- Feb 14 — we released a new endpoint that negatively affected dbt Cloud API and scheduler performance. We reimplemented the endpoint, and are incorporating load testing and better database partitioning into our roadmap to prevent similar cascading impact.
- February 14 to today — we’ve experienced instability from our underlying infrastructure provider resulting in periodic, infrequent, but very impactful slowdowns, in particular in IDE startup time. On February 17, we applied aggressive mitigation to this issue which resulted in major customer impact due to backed-up load on our infrastructure cluster. We are in active communication with our provider to address this and we will make dbt Cloud resilient to these forms of failure in the future.
While there isn’t one cause to these incidents, there is a common theme: we need to change modes on our engineering team.
Reorienting our roadmap around scale and performance
Our roadmap is full of exciting initiatives. We already had a large portion of our engineering organization working to scale up dbt Cloud (we’ve seen a 3x increase in models built year over year, and 2.5x increase in IDE usage in the last six months!). But we ship hundreds of changes a month unrelated to scaling: bugfixes, UX improvements, security patches, new feature rollouts.
We’ve been moving forward without enough emphasis on safety over the last 2-3 months. We’ve made progress, but it’s come at too steep a cost in your trust in our platform. When we look back as a team on those incidents, we’re not happy with the tradeoff we’ve made here.
So, over the next few months, we’re re-orienting our roadmap around improved scaling, performance, and safety. We’ve prioritized major cross-team initiatives that will:
- Empower our engineering teams to release code and configuration changes more safely,
- Enable us to notify you when underlying infrastructure has an issue, before it causes a business-critical impact, and
- Allow us to scale up more seamlessly as we navigate this incredible growth.
This will be our focus for the next 2-3 months. And, when we resume “normal” engineering operations, we’ll do so with more safeguards in place, more visibility into underlying infrastructure, and better ability to scale.
Improving the architecture of dbt Cloud
We’re not stopping there. We understand that your dbt Cloud experience needs to not only be stable enough for you to be able to trust us with your most important workflows, but it also needs to be quick enough to keep up with your fast pace of development.
I’m happy to share that we have already been working on making improvements to fundamental architecture of dbt Cloud to enable it to meet you wherever you are:
- Today, the IDE requires a dedicated bootup server that contributes to slower IDE loading times. Since last year, we have been working on an always on filesystem service that will enable us to give you “instant” access to your development environment.
- Today, the IDE loads all the files you need into the browser in order to enable you to work and re-syncs them when you make changes. This can cause delays when you need speed the most, and we are planning to eliminate the need to re-sync entirely.
- We are wrapping up improvements to the performance and reliability of scheduled runs — we are running an internal beta as I’m writing this post!
- Finally, we are actively looking to hire more humans, especially full stack, distributed systems, and staff-level engineers, who can help us tackle these big, important milestones and improve the service for all of you.
Hopefully this gives you a sense of where our attention is as a team: improving the performance and reliability of the product experiences in dbt Cloud, not by 1% or 5%, but by leaps and bounds.
We want to hear from you
Your experience with dbt Cloud matters to us. We’ve received an incredible amount of feedback, words of encouragement, and direct asks for us to improve. Some of these things are hard to hear, but all of these are things we want to hear because it means we are building a thing that people care about.
A few ways you can talk to us — we love them all!
- If you are an async communicator: reach out to firstname.lastname@example.org and describe the issue you’re experiencing. We like screenshots, loom videos, sketches on an iPad or whatever you need to express the problem you are experiencing quickly and directly. We also hang out in the #dbt-cloud channel in the dbt Community Slack, but prefer to get your feedback directly in our support team’s inbox.
- If you prefer to jump on a call: reach out to the humans at dbt Labs who are paying particular attention to your experience with dbt Cloud today: Anna (Head of Community) & Allison (Customer Success Strategy).
Last modified on: Nov 29, 2023