How to Use Pressure Stall Information (PSI) Metrics in Kubernetes v1.36+

Introduction

Monitoring resource contention in Kubernetes has long relied on simple utilization metrics—CPU percentage, memory usage, I/O rates. But these numbers alone can mask the real pain: tasks waiting in line. Pressure Stall Information (PSI), introduced in the Linux kernel in 2018 and now graduated to General Availability (GA) in Kubernetes v1.36, provides precise signals of resource saturation before it turns into an outage. PSI tells you the percentage of time that tasks are stalled on CPU, memory, or I/O resources, capturing both cumulative totals and moving averages (10s, 60s, 300s windows). This guide walks you through enabling, accessing, and interpreting PSI metrics in your Kubernetes cluster so you can detect and diagnose resource pressure early.

How to Use Pressure Stall Information (PSI) Metrics in Kubernetes v1.36+
Source: kubernetes.io

What You Need

Step 1: Verify Linux Kernel PSI Is Active

PSI must be enabled at the kernel level. On each node you plan to monitor, run:

cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

If you see output like some avg10=0.00 avg60=0.00 avg300=0.00 total=0, PSI is working. If the files don't exist, PSI is not compiled or disabled. Check kernel config: zgrep CONFIG_PSI /proc/config.gz. If missing, you need a kernel rebuild or a different distribution.

Step 2: Enable the KubeletPSI Feature Gate (if not already)

In Kubernetes v1.36, the KubeletPSI feature gate is enabled by default. But if you are on an earlier version or have custom kubelet configuration, ensure it's turned on. Edit the kubelet configuration file (often /var/lib/kubelet/config.yaml) and add:

featureGates:
  KubeletPSI: true

Then restart the kubelet:

systemctl restart kubelet

To confirm the feature is active, check the kubelet logs for a line like PSI metrics collection enabled.

Step 3: Access PSI Metrics from the Kubelet

The kubelet exposes PSI metrics on its metrics endpoint (default port 10250). Query node-level pressure:

kubectl get --raw /api/v1/nodes/<node-name>/proxy/metrics | grep psi

You'll see metrics like:

To drill down to pod and container levels, use the container_psi_* series. For example:

container_psi_cpu_pressure_avg10{container="my-app",namespace="default"}

Step 4: Interpret PSI Metrics Correctly

Understanding what PSI numbers mean is key. There are two types of pressure per resource:

Each metric comes with moving averages over 10, 60, and 300 seconds. A high 10-second average indicates a transient spike; a consistently high 300-second average signals sustained contention. The total field is the cumulative stalled time in microseconds since boot.

Example: If node_psi_cpu_pressure_avg10 is 15.3, it means over the last 10 seconds, an average of 15.3% of time tasks were waiting for CPU. This is far more informative than raw CPU utilization, which might show 80% but hide scheduling delays that cause PSI to spike.

Step 5: Set Up Continuous Monitoring

Prometheus can scrape these metrics. Add a scrape target in your prometheus.yml:

scrape_configs:
  - job_name: 'kubelet-psi'
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    kubernetes_sd_configs:
      - role: node

Then create Grafana dashboards using the PSI metrics. Alert on thresholds: for example, if avg60 exceeds 50 for memory, it may indicate imminent OOM.

Step 6: (Optional) Tune for Production Scale

SIG Node's performance testing (detailed in the Kubernetes v1.36 release blog) showed that enabling PSI collection adds negligible overhead—less than 2.5% of CPU on a 4-core node with 80+ pods. The kubelet's polling logic is lightweight and blends into standard housekeeping. No extra tuning is required, but if you run on very small nodes (<2 cores), you can disable the feature gate if needed (not recommended).

Tips

Tags:

Recommended

Discover More

GitHub's Enhanced Status Page: Greater Transparency and AccuracyPreserving the American Dream: Urgent Action Needed to Restore Opportunity for All10 Crucial Updates for the nvptx64-nvidia-cuda Target in Rust 1.97Researchers Turn Diffusion Models to Video Generation, Pushing Boundaries of AI CreativityUnderstanding the New Python Packaging Council: A Complete Guide