Hi 👋! It's Friday, and thank you for signing up.
"Code with passion, debug with patience, and deploy with confidence—every challenge is an opportunity to grow."
The First Strike ever‼️
On November 12, 2024, the design world was plunged into darkness.
For millions of Canva users, a beloved, indispensable tool became inaccessible.
This caused a chaotic series of events revealing cracks in an otherwise powerful infrastructure.
It began like any other morning.
At 8:47 AM UTC, Canva deployed a new version of its editor, eager to deliver improvements and new features to users.
But surprise to the team, the deployment would set in motion a chain of events that would send shockwaves through the system.
As users around the world began loading the editor, their requests started to take longer than usual.
In Southeast Asia, a network glitch between Cloudflare's data centers in Singapore(SIN) and Ashburn(IAD) created a bottleneck.
For users in this region, something as simple as fetching a JavaScript file became a monumental task. That file, which controlled the critical object panel of Canva's editor, had fallen victim to the network's hiccup, forcing the panel into an eternal state of loading.
The problem became worse when 270,000+ requests accumulated on a single cache stream, all waiting for that elusive chunk of code.
It was the perfect storm—overloaded servers, frustrated users, and no solution in sight.
The stakes couldn't have been higher: for many, Canva was their creative lifeline, a tool for work, passion, and business.
The JavaScript file was retrieved at 9:07 AM UTC, but the damage was done. More than a million requests came into the system in a matter of seconds, and Canva's API Gateway-the critical entry point for all user requests-began to buckle under the pressure.
Initially, the new tasks started up in response to the surge.
However, they were soon consumed by the crushing load. The autoscaling mechanisms, which were designed to prevent the system from failure, only accelerated the inevitable collapse.
The API Gateway tasks started failing one by one, causing a cascade of terminations that spiraled rapidly out of control.
But it wasn't only about the sheer volume of traffic that brought down Canva.
For there was a lurking vulnerability, an unnoticed bug in the telemetry code, that had triggered a subtle performance regression.
The bug was like a crack in the foundation that no one noticed until, under strain, widened to undermine the entire structure.
At such a time, as the system struggled to cope, the Linux Out Of Memory Killer stepped in and shut down containers to prevent total system failure.
It was a chaotic plunge.
Nearly 40 minutes passed with no change in the continued downtime.
However, by 9:29 AM UTC, the Canva team decided enough was enough – block all traffic coming in. That's when they shut down completely, showing a temporary "traffic block" page instead, to little comfort from anxious users wondering if they'd soon be able to reaccess their projects.
It was a ticking clock, the weight of responsibility it accompanied. Every single second counted.
The team tried to stabilize the system by launching more and more API Gateway tasks; however, the insurmountable load terminated them instantly.
To buy time, the team turned to Cloudflare, temporarily blocking traffic at the CDN level.
It was a rather bold move—a necessary evil. It bought precious minutes.
The team slowly brought the system back online after a careful, measured restoration process and started restoring service first with Australian users, who were the first affected by the storm.
They carefully monitored the traffic and continued to restore more regions when stability was regained.
The lesson was learned: Canva's infrastructure had to adapt to unforeseen storms.
It was a perfect storm of technical failures, but also of human oversight.
The bug in the telemetry code could have been caught.
The surge in traffic could have been anticipated. And the vulnerability in the network could have been addressed before it wreaked havoc.
In the aftermath, the team went into full recovery mode. Improvement was implemented at every level of the system-from the API Gateway to the CDN.
The telemetry bug was fixed, and new safeguards were added to prevent another performance regression. The incident response process was refined, and Canva's resilience was fortified.
But it wasn't just about fixing what was broken—it was about learning. The entire team, from engineers to product managers, took this incident as a hard-earned lesson.
Transparency became a priority, with Canva's first-ever publicly shared incident report. The team promised to do better, not just for themselves, but for the millions of users who trusted the platform every day.
It was a tough blow but ultimately would forge a stronger, more resilient Canva that would emerge better-equipped for the future.
Read More: Canva incident report: API Gateway outage
Hope you enjoyed reading this article.
If you found it valuable, hit a like ❤️ and consider subscribing for more such content every week.
If you have any questions or suggestions, leave a comment.