Kubernetes v1.36: Smarter Controller Caches and Real-Time Insight

Welcome to our deep dive into Kubernetes v1.36's latest features! We'll explore how the new release tackles a sneaky problem called staleness in controllers. Staleness can make controllers act on outdated information, causing missed actions or incorrect decisions. v1.36 brings powerful improvements to client-go (the library controllers use) and the most popular controllers in kube-controller-manager. These changes not only prevent stale data but also give you better visibility into what's happening. Let's answer your burning questions about these updates.

What is staleness in Kubernetes controllers?

Staleness describes a situation where a controller has an outdated view of the cluster. Controllers keep a local cache so they can respond quickly without hitting the API server on every action. This cache is filled by watching the API server for object changes. But if the cache falls behind – for example, after a controller restart or an API server outage – the controller might see old data. It might then take an incorrect action (like scaling a deployment up when it should be down) or do nothing when it should act. Staleness is often invisible until something breaks in production, making it a subtle but dangerous issue.

Kubernetes v1.36: Smarter Controller Caches and Real-Time Insight

How does staleness affect controller behavior?

Staleness can cause three main problems:

Incorrect actions: The controller acts on stale data and makes a wrong decision, like scaling a service to zero when it's actually needed.
Missing actions: The controller fails to react because its cache says everything is fine, even though the real cluster state has changed.
Slow responses: Even if the controller eventually corrects itself, the delay can hurt user experience or cause cascading issues.

These problems often stem from assumptions developers make about cache consistency – assumptions that break when events come out of order or the cache rebuilds slowly.

What causes controller cache to become stale?

Several scenarios can leave a controller's cache outdated:

Controller restarts: The cache must be rebuilt from scratch by listing and watching objects. During this window, the cache is empty or partially filled.
API server outages: If the API server is unreachable, the watch connection breaks and the cache stops updating. Old data remains until the connection resumes.
Out-of-order events: When the informer receives events in a different order than they happened, the cache can temporarily reflect an impossible state (e.g., a pod appearing before its namespace).
High churn: In busy clusters, many changes can overwhelm the watch stream, causing it to fall behind and drop events, leading to a stale cache.

What improvements does v1.36 bring?

Kubernetes v1.36 focuses on two areas:

client-go enhancements: The core library that all controllers use now includes an atomic FIFO feature (behind the AtomicFIFO feature gate). This ensures that when events arrive in batches (e.g., the initial list that populates the cache), they are processed as an atomic unit. This prevents the queue from getting into inconsistent intermediate states.
Controllers using these improvements: The most contended controllers in kube-controller-manager (like deployment, replica set) have been updated to use the new atomic FIFO. This makes them more resilient to stale caches.

What is atomic FIFO and how does it help?

Atomic FIFO is a new mode for the queue that processes events in batches as a single, indivisible operation. The existing FIFO queue allowed events to be added individually as they arrived. If events came out of order (e.g., the initial list arrived as separate events), the queue might process a partial set before the rest arrived. This could cause the controller to see an inconsistent cache snapshot.

With atomic FIFO, all events from a batch (like the initial list or a set of related updates) are held until the batch is complete, then applied atomically. After that, the cache reflects a consistent, up-to-date state. This reduces the chance that a controller will act on stale or partial information. The feature gate is named AtomicFIFO and is opt-in initially; you can enable it to get these benefits now.

How can I take advantage of these improvements?

To use the new atomic FIFO in your own controllers:

Update your client-go dependency: Make sure you're using a version of client-go that includes the AtomicFIFO feature (available in Kubernetes v1.36).
Enable the feature gate: When initializing your informer factory or queue, set the AtomicFIFO feature gate to true. This can be done via environment variable or controller flags.
Monitor your controller: With atomic FIFO, you can also introspect the cache to see the latest resource version processed, giving you better observability into cache freshness.

For operators, simply upgrading kube-controller-manager to v1.36 will bring these benefits to built-in controllers. No configuration changes are required unless you want to opt in for custom controllers.

How does v1.36 improve observability for controllers?

Alongside staleness mitigation, v1.36 adds better observability hooks. With atomic FIFO, the queue exposes metrics and logging that let you see when batches are processed and how far behind the cache might be. You can query the latest resource version applied to the cache and compare it to the current state of the API server. This makes it easier to detect staleness before it causes harm. Previously, you had to rely on indirect signals or wait for a bug report. Now, operations teams can set up alerts for cache lag, helping them catch issues early.

These observability improvements complement the correctness fixes, giving you both a safer and more transparent controller ecosystem.

Tags: