When Safety Nets Become Traps: Rethinking Scale Defenses

<p>Defense systems are essential for keeping large platforms healthy, but they can become liabilities if not regularly reviewed. This article explores a real incident where outdated protections caused false positives, blocking legitimate users, and what we learned about maintaining observability and regularly pruning emergency measures.</p> <h2 id="what-happened">What exactly went wrong with the protection systems?</h2> <p>Our platform relies on multiple layers of rate limits, traffic controls, and other defenses to prevent abuse and keep the service responsive. Recently, some of these protections, originally added during emergency responses to attacks, were left in place long after the threat subsided. These rules were designed to block abusive traffic patterns but started matching legitimate, low-volume requests. Users began reporting <strong>“Too many requests”</strong> errors during normal browsing—for instance, when clicking a link from another app or simply navigating without any suspicious activity. The error should not have appeared for these users, as their request volume was well within normal parameters. This disruption highlighted a critical oversight: protections can quietly outlive their purpose and start harming the very users they were meant to protect.</p><figure style="margin:20px 0"><img src="https://github.blog/wp-content/uploads/2024/06/AI-DarkMode-4.png?resize=800%2C425" alt="When Safety Nets Become Traps: Rethinking Scale Defenses" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: github.blog</figcaption></figure> <h2 id="user-reports">How did users report the issue and what did they experience?</h2> <p>Users took to social media to share their frustration. They described receiving <strong>“too many requests”</strong> error messages during routine, low-volume browsing. For example, someone following a GitHub link from a blog post, or just exploring a repository without any automation, would suddenly hit a rate limit. These were not power users scraping data or running scripts; they were making a handful of normal HTTP requests. The errors disrupted their workflow and caused confusion. Some reported it happened repeatedly over several days. Although the total number of affected users was small relative to our overall traffic—roughly <strong>0.003-0.004%</strong> of all requests—for those individuals the impact was significant. We apologize sincerely for the inconvenience and acknowledge that any incorrect blocking is unacceptable, even if statistically rare.</p> <h2 id="root-cause">What was the root cause of these false positives?</h2> <p>Investigating the reports led us to the root cause: old protection rules that were added during previous abuse incidents had not been removed. These rules combined <em>industry-standard fingerprinting techniques</em> with <em>platform-specific business logic</em>. Fingerprinting helps identify characteristics commonly associated with abusive clients (e.g., certain headers, request patterns, or device signatures). The business logic then determined which of those fingerprint matches should actually be blocked. During an incident, these rules were crafted to be broad enough to stop the attack quickly. However, once the attack ended, the rules remained active. Over time, the same fingerprints that flagged abusive traffic began matching some legitimate, logged-out users. Specifically, about <strong>0.5-0.9%</strong> of requests that matched the fingerprint patterns were also caught by the business-logic rules and blocked 100% of the time. This composite approach, while generally effective, produced false positives because the emergency-era patterns were too aggressive for normal conditions.</p> <h2 id="fingerprinting-role">How did the composite fingerprinting actually work?</h2> <p>We use a multi-signal approach to distinguish legitimate users from abusers. The first layer consists of <strong>fingerprinting techniques</strong>—these are combinations of attributes like HTTP headers, TLS handshake parameters, and other client-side characteristics that are often unique to automated or malicious clients. The second layer is <strong>platform-specific business logic</strong>, which looks at behavior like unusually fast navigation, repeated access to private endpoints, or other patterns common in abuse. When a request matches certain fingerprints, it is flagged as suspicious. Then the business-logic rules evaluate that flagged request and decide whether to apply a rate limit or block. In the incident, about <strong>99.1-99.5%</strong> of flagged requests were allowed through, meaning they only matched the fingerprints but not the business logic. However, the remaining <strong>0.5-0.9%</strong> that did match both criteria were completely blocked. Unfortunately, some of those were legitimate users whose request patterns happened to align with the emergency-era composite signals. This taught us that while composite signals are powerful, they must be continuously validated against current traffic.</p><figure style="margin:20px 0"><img src="https://github.blog/wp-content/uploads/2024/05/Enterprise-DarkMode-3.png?resize=800%2C425" alt="When Safety Nets Become Traps: Rethinking Scale Defenses" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: github.blog</figcaption></figure> <h2 id="impact-scale">What was the scale of the impact and why is it unacceptable?</h2> <p>The false-positive rate was extremely low when viewed against total traffic—approximately <strong>0.003-0.004%</strong> of all requests to the platform. In raw numbers, that meant only a tiny fraction of users were affected. However, for those users, the experience was entirely broken. They could not perform normal actions like viewing a repository or following a link, and the error gave no indication of why they were blocked or how to resolve it. From a metrics perspective, the incident was a statistical blip, but from a user experience perspective, it was a complete failure. We set high standards for availability and accessibility, and even one incorrectly blocked legitimate request is too many. The lesson is that defense systems must be audited regularly, especially those added under time pressure. The cost of a false positive can be disproportionate to its frequency, eroding trust and causing real frustration. We are committed to doing better.</p> <h2 id="lessons-learned">What lessons did we learn about managing defense systems?</h2> <p>This incident reinforced several key principles for operating defense at scale. First, <strong>observability must extend to protective systems</strong>, not just features. We need real-time dashboards and alerts that show false-positive rates for every rule, especially those deployed in emergencies. Second, all emergency measures should have <strong>automatic expiration dates</strong> or require periodic review. A rule added during an attack should be re-evaluated after a set period (e.g., 30 days) to determine if it still makes sense. Third, user feedback is invaluable: reports of unusual blocks must be investigated promptly and with a bias toward action. We are implementing a process where any protection rule that triggers a false positive is immediately flagged and reviewed. Finally, we learned that <strong>composite signals need careful tuning</strong>; business-logic rules should be as narrow as possible to minimize side effects. Moving forward, we will adopt a more systematic approach to defense lifecycle management—protections should expire if not continuously validated. We apologize again for the disruption and thank users for their patience and reports.</p>

Tags: