The Art of Kubernetes Request Sizing

Stop guessing your CPU and RAM requests. A data-driven guide to right-sizing your pods without killing performance.

J
Jesus Paz
2 min read

In my previous post about CPU limits, I argued that limits are often dangerous. But if you remove limits, your Requests become the most critical configuration in your cluster.

Requests are the contract you sign with the Kubernetes scheduler. Set them too high, and you waste money (the “Slack Tax”). Set them too low, and your nodes get oversubscribed, leading to CPU contention and OOM kills.

Most teams guess. They look at a graph, squint, and pick a number.

There is a better way.

The “Burstable” Reality

Kubernetes workloads are rarely flat lines. They are spiky.

If you size for the peak, you waste 80% of your money. If you size for the average, you crash during the peak.

The art of sizing is finding the percentile that balances risk and cost.

Step 1: Install VPA (Even if you don’t use it)

The Vertical Pod Autoscaler (VPA) has a “Recommendation” mode. It watches your pods and tells you what it would set the requests to.

Install it. Let it run for a week. It gives you a baseline grounded in reality, not guesswork.

Terminal window
kubectl get vpa my-app -o yaml

Look at the target recommendations. That is your new baseline.

Step 2: The 95th Percentile Rule

For customer-facing services, I recommend setting CPU requests to the 95th percentile of usage during your busiest hour.

Why not 100%? Because CPU is compressible. If you burst above your request for a few seconds, it’s fine. You steal cycles from low-priority neighbors.

For Memory, however, you must size for the 100th percentile + buffer. RAM is not compressible. If you run out, you die (OOM).

Step 3: Load Test Your Assumptions

Don’t just trust production traffic. Run a load test.

  1. Deploy your app with your new calculated requests.
  2. Hit it with 2x your peak traffic using k6 or Locust.
  3. Watch the Throttling metrics (if you still have limits) or CPU Usage.

If latency stays flat, your requests are good. If latency spikes, your requests are too low (the node is too crowded).

Conclusion

Right-sizing is not a one-time task. It’s a loop.

  1. Measure (VPA/Prometheus).
  2. Adjust (GitOps).
  3. Verify (Cost/Performance).

In the next post, we’ll talk about how to automate this loop so you never have to edit a YAML file again.

👨‍💻

Jesus Paz

Founder & CEO

Read Next

Join 1,000+ FinOps and platform leaders

Get Kubernetes and ECS cost tactics delivered weekly.