How Cloudflare's 'Code Orange: Fail Small' Project Strengthened Network Resilience

Cloudflare recently completed a major engineering initiative called "Code Orange: Fail Small," aimed at making its infrastructure more resilient, secure, and reliable. Over two and a half quarters, the team focused on preventing future outages like those on November 18 and December 5, 2025, by improving configuration safety, reducing failure impact, and enhancing incident response. This project introduced key tools and processes that now protect customer traffic more effectively. Below, we answer common questions about what was done, how it works, and what it means for you.

What is Code Orange: Fail Small?

Code Orange: Fail Small is an internal project name for a comprehensive effort to make Cloudflare's network more resilient. The project targeted weaknesses exposed by two global outages in late 2025. Instead of merely patching those specific issues, the team redesigned how configuration changes are deployed, how failures are contained, and how the company communicates during incidents. The core philosophy is to ensure that any failure affects only a small portion of the network, not the entire system. This was achieved through new tools and stricter procedures that now apply across all product teams. The project is considered complete, but resilience remains an ongoing priority integrated into Cloudflare's development lifecycle.

How Cloudflare's 'Code Orange: Fail Small' Project Strengthened Network Resilience
Source: blog.cloudflare.com

Why was this project necessary?

Cloudflare experienced two global outages in late 2025—on November 18 and December 5. These incidents were traced back to configuration changes that had unintended widespread effects. The traditional deployment methods lacked gradual rollout and real-time health checks, so a single mistake could cascade across the entire network. The Code Orange project was launched to prevent such scenarios by redesigning how changes are introduced. The goal was to catch problems early, limit their blast radius, and automate rollbacks before customers are impacted. This proactive approach reduces the risk of future outages and builds long-term reliability for the millions of websites and services that rely on Cloudflare.

What are the key areas of improvement?

The project focused on four main areas: safer configuration changes, reducing the impact of failures, revising break-glass procedures, and improving incident management. Additionally, the team introduced measures to prevent drift and regressions over time, and strengthened customer communication during outages. Safer configuration changes involve health-mediated deployment, which uses real-time monitoring to automatically roll back problematic updates. Reducing failure impact includes regional isolation techniques so a problem in one data center doesn't affect others. Break-glass procedures were updated to ensure emergency access is both fast and safe. Finally, incident management now has clearer roles, better documentation, and faster communication channels.

How does Cloudflare make configuration changes safer?

Previously, many internal configuration changes were deployed instantly across the entire network. With Code Orange, Cloudflare now uses a health-mediated deployment methodology for all high-risk configuration updates. This means changes are rolled out progressively, with real-time health monitoring at each step. If a problem is detected, the system automatically reverts the change before it affects customer traffic. The key enabler is a new internal component called Snapstone, which provides a unified platform for gradual releases. Teams can now define any configuration unit—such as a data file or control flag—and deploy it safely. This approach was inspired by how software releases are already managed, and it closes a critical gap in configuration management.

How Cloudflare's 'Code Orange: Fail Small' Project Strengthened Network Resilience
Source: blog.cloudflare.com

What is Snapstone and how does it work?

Snapstone is a custom system built by Cloudflare to bring health-mediated deployment to configuration changes. It works by bundling configuration updates into packages and releasing them gradually across the network. The system continuously monitors health metrics—like error rates, latency, and CPU usage—after each incremental rollout. If any metric crosses a predefined threshold, Snapstone automatically initiates a rollback to the previous version. What makes Snapstone powerful is its flexibility: it can manage any type of configuration, from data files that define routing rules to control flags that enable new features. Previously, applying health mediation to config required custom per-team effort and was inconsistently applied. Snapstone standardizes this process, making safe deployments the default for all product teams.

How does this affect customers?

For most Cloudflare customers, these changes improve reliability without any required action on your part. The most visible effect is that internal configuration updates no longer reach the network instantly. Instead, they are rolled out incrementally with automated health checks. This means that if a change would cause problems, it is caught and reverted before your traffic is impacted. Over time, customers will experience fewer unplanned outages and more stable performance. Additionally, Cloudflare has strengthened its communication during incidents, providing clearer and more frequent updates. While no network can be 100% immune to failures, Code Orange significantly reduces the likelihood and scope of major disruptions, giving customers greater confidence in the platform.

How does Cloudflare prevent regressions over time?

To ensure that the improvements from Code Orange remain effective, Cloudflare introduced measures to prevent drift and regressions. This includes periodic audits of configuration pipelines, automated testing of break-glass procedures, and continuous monitoring of deployment health. All product teams are now required to use Snapstone for any configuration change that could affect customer traffic. The company also updated its incident management playbooks to include post-mortem reviews that check whether new safeguards were bypassed. By embedding resilience into the development lifecycle, Cloudflare aims to maintain the high bar set by this project and avoid slipping back into old habits. These measures are part of a broader culture of reliability that now influences how every team plans and deploys changes.

Tags:

Recommended

Discover More

A Step-by-Step Guide to Obtaining Python 3.13.10How to Oppose an EU Trademark Application: Lessons from Apple's Citrus Logo Dispute10 Key Insights Into Microsoft's Sovereign Private Cloud Scaling with Azure Local6 Essential Strategies to Make Man Pages Truly Useful10 Lessons from Vienna’s Intellectual Circle for Designing Amiable Online Communities