Kubernetes Cost Optimization: How to Cut Your K8s Bill by 40%
A practitioner's guide to cutting Kubernetes costs by 40%—resource requests, right-sizing pods, spot nodes, cluster autoscaler, namespace quotas, idle workload detection, and cost monitoring tools.
Kubernetes makes it easy to run workloads. It also makes it remarkably easy to waste money at scale.
The resource model that gives you flexibility—request what you need, get scheduled accordingly—becomes a cost liability when those requests are set incorrectly. A pod requesting 2 CPU and 4GB RAM that actually uses 0.3 CPU and 600MB RAM is consuming capacity on a node that cost real money to provision, even though most of that capacity sits idle.
Multiply that pattern across dozens of services, add in nodes that were never right-sized, a cluster autoscaler that scales up aggressively but scales down slowly, and no tooling to surface what anything actually costs—and you have the typical Kubernetes environment we see when we start working with a new client.
The 40% figure in this title isn't aspirational. It's the average we see across environments with $20,000–$200,000/month in K8s spend when all the levers in this guide are applied. Some environments come in higher (50–60%) when the baseline is especially undisciplined. Some come in at 25–30% when the team has already done partial optimization work.
This post covers every lever worth pulling, in order of typical impact.
Why Kubernetes Cost Is Hard to Control
Before getting into fixes, it helps to understand why K8s costs get out of hand even in well-run engineering organizations.
Resource requests are set once and never revisited. A deployment manifest has resource requests defined when the service is first written. Those requests reflect the engineer's best guess at the time, usually with conservative headroom. Six months later, the service has been optimized, traffic patterns are understood, and the original requests are 3x what the workload actually needs. Nobody goes back to tune them.
Costs are invisible at the workload level. Your AWS bill tells you what you spent on EC2 and EKS. It does not tell you which namespace, deployment, or team spent what. Without cost attribution, there's no incentive for individual teams to optimize their resource footprint.
Cluster autoscaler behavior is asymmetric by default. Nodes scale up quickly (good for reliability) and scale down slowly (bad for cost). The default scale-down delay is 10 minutes, but conservative configurations often push this to 30–60 minutes. A node that becomes underutilized at 2am may not be reclaimed until morning.
Spot/preemptible nodes are underused. Most teams know spot instances exist. Most teams haven't gone through the work of identifying which workloads can tolerate interruption and configuring the node pool separation required to use them safely.
Each of these has a concrete fix. Here's how to approach them systematically.
1. Fix Resource Requests and Limits First
Typical savings: 20–35% of compute costs
This is the highest-leverage item in most environments and the one that unlocks everything else. Cluster autoscaler cannot right-size your nodes if your pods are requesting 4x what they actually use—the scheduler sees demand that doesn't reflect reality, and the cluster stays larger than it needs to be.
Requests vs. Limits: The Distinction That Matters
Requests are what the Kubernetes scheduler uses to place pods on nodes. A pod requesting 1 CPU will only be scheduled on a node that has 1 CPU available. Requests directly determine how much node capacity you need—and therefore how much you spend.
Limits cap how much a pod can consume. A pod with a CPU limit of 2 CPU can burst up to 2 CPU if the node has capacity available, but it is guaranteed only what it requested. A pod with a memory limit that it exceeds gets OOMKilled.
The common failure modes:
- Requests too high, limits too high: Pods reserve capacity they don't use. Nodes appear full but are actually underutilized. New pods can't schedule, triggering premature scale-up.
- Requests too low, no limits: Pods compete for resources unpredictably, causing noisy-neighbor problems and intermittent performance degradation.
- No requests or limits set: Kubernetes assigns Burstable or BestEffort QoS class; workloads become unpredictable under load.
How to Right-Size Requests
Pull actual usage data before changing anything. For each deployment, collect P99 CPU and P99 memory usage over 7–14 days using Prometheus queries:
# P99 CPU usage by container over 7 days
quantile_over_time(0.99,
rate(container_cpu_usage_seconds_total{container!=""}[5m])[7d:5m]
)
# P99 memory usage by container over 7 days
quantile_over_time(0.99,
container_memory_working_set_bytes{container!=""}[7d]
)
Set requests at P99 usage × 1.2 (20% headroom). Set memory limits at P99 × 1.5 to absorb spikes without OOMKill. Set CPU limits generously or not at all—CPU throttling is a common, silent performance problem that's worse than the cost of leaving CPU limits unset.
VPA as an Automation Layer
The Kubernetes Vertical Pod Autoscaler (VPA) can automate request recommendations. Run VPA in Recommendation mode first—it analyzes usage and publishes recommendations without applying them. Review the recommendations against your P99 data, then apply selectively.
VPA in Auto mode applies recommendations automatically with pod restarts. Use with caution for stateful workloads; it's well-suited for stateless services with low restart sensitivity.
2. Right-Size Your Node Types
Typical savings: 10–20% of node costs
Node type selection has a compounding effect: if your pods are right-sized, your node pool configuration determines how efficiently those pods pack onto nodes. A cluster running all m5.4xlarge nodes (16 vCPU / 64GB) will waste more capacity than a cluster running a mix of m5.2xlarge and m5.xlarge nodes—because pods that need 2 CPU / 4GB can't fill an m5.4xlarge without a lot of company.
Bin packing analysis: For each node pool, look at the actual allocatable CPU and memory, and compare to the aggregate requests of the pods scheduled there. If nodes are consistently 30–40% allocated but "full" (no more pods fitting), you have a bin-packing problem—your pod sizes don't fit efficiently into your node sizes.
Tools for this analysis:
- Karpenter (AWS-native, recommended over cluster-autoscaler for new environments) — selects node types dynamically based on pending pod requirements, choosing the cheapest instance type that satisfies the workload
- Kubecost → Cluster Efficiency page → shows allocatable vs. requested vs. actual utilization per node pool
kubectl-resource-capacity— CLI plugin, shows per-node allocation vs. utilization at a glance
3. Spot and Preemptible Nodes for Interruptible Workloads
Typical savings: 60–80% on covered compute (spot discount vs. on-demand)
Spot instances (AWS) and preemptible VMs (GCP) offer 60–80% discounts in exchange for the possibility of interruption with 2 minutes' notice. For the right workloads, this is the highest raw discount available in cloud compute.
The key is workload classification. Not everything can run on spot—but more can than most teams assume.
Good candidates for spot nodes:
- Batch and ETL jobs (can checkpoint and retry)
- CI/CD build runners (ephemeral by design)
- Stateless microservices with multiple replicas (one interruption doesn't cause an outage)
- Dev and staging environments (tolerance for occasional disruption)
- Data processing pipelines with restart capability
Should stay on on-demand:
- Stateful services (databases, message queues, distributed storage)
- Single-replica deployments (one interruption = outage)
- Services with strict latency SLAs where pod restart is costly
- Control plane components
Implementation Pattern
Use node pool taints and pod tolerations to separate workloads cleanly:
# Spot node pool taint (apply to all spot nodes)
taints:
- key: "spot"
value: "true"
effect: "NoSchedule"
# Pod toleration for spot-eligible workloads
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# Node affinity preference (prefers spot, falls back to on-demand)
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: "node.kubernetes.io/lifecycle"
operator: In
values: ["spot"]
For AWS, configure your node groups to use a diversified instance type list across multiple families and sizes—this dramatically reduces interruption frequency by sourcing capacity from multiple spot pools.
Karpenter handles this more elegantly than managed node groups: you define a NodePool with a list of compatible instance families, and Karpenter selects the cheapest available type at scheduling time. Combined with spot prioritization, Karpenter consistently delivers 50–70% node cost reduction versus static on-demand node groups.
4. Cluster Autoscaler Configuration
Typical savings: 10–20% by reducing idle node time
The cluster autoscaler adds nodes when pods can't schedule and removes nodes when they become underutilized. The default configuration is conservative on scale-down—which is safe but expensive.
Key Parameters to Tune
# cluster-autoscaler deployment args
- --scale-down-delay-after-add=5m # default: 10m — how long after scale-up before scale-down is considered
- --scale-down-unneeded-time=5m # default: 10m — how long a node must be underutilized before removal
- --scale-down-utilization-threshold=0.5 # default: 0.5 — node is "unneeded" if total requests < 50% of allocatable
- --max-node-provision-time=15m # fail fast on node provisioning issues
- --skip-nodes-with-local-storage=false # if you don't run local storage, set this to false
Reducing scale-down-unneeded-time from 10 minutes to 5 minutes has minimal reliability impact (a node that's been underutilized for 5 minutes is very unlikely to fill up again) but materially reduces idle node time during traffic troughs. For environments with predictable diurnal traffic patterns, this is a quick win.
Expander strategy: The least-waste expander selects the node group that leaves the least unused CPU/memory after scheduling. It tends to produce better bin-packing than the default random expander. Set --expander=least-waste.
Scale-Down Blockers
Check what's preventing scale-down in your cluster. Common blockers:
- Pods with no
PodDisruptionBudgetthat the autoscaler is reluctant to evict kube-systempods that can't be moved (DaemonSets, static pods)- Pods with local storage attached
- Pods stuck in Terminating state
Run kubectl get events --field-selector reason=NotTriggerScaleDown to see autoscaler decisions and their reasons.
5. Namespace Resource Quotas
Typical savings: indirect — prevents cost surprises and enforces team accountability
ResourceQuotas impose hard limits on the total CPU and memory that can be requested within a namespace. Without them, a single deployment with misconfigured requests can consume the entire cluster's allocatable capacity and trigger unnecessary node scale-up.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "50"
LimitRanges complement quotas by enforcing per-pod defaults and bounds—if a pod doesn't specify requests, LimitRange applies defaults; if a pod exceeds the max, it's rejected.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-payments
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
type: Container
The cost impact is indirect: quotas prevent uncontrolled resource consumption and make teams responsible for their own footprint. Paired with cost attribution tooling, quotas create the accountability structure that sustains long-term optimization.
6. Detect and Eliminate Idle Workloads
Typical savings: $500–$5,000/month depending on environment size
Idle workloads are deployments that are running and consuming resources but serving no meaningful traffic. Common causes: feature flags that are permanently disabled, migrated services that weren't decommissioned, staging deployments that outlasted their sprint, and shadow services that were never promoted.
Detection Approach
Combine two signals: low CPU utilization and low request rate over a 30-day window.
# Deployments with average CPU < 1% of request over 30 days
avg_over_time(
rate(container_cpu_usage_seconds_total[5m])[30d:5m]
) / on(pod) kube_pod_container_resource_requests{resource="cpu"}
< 0.01
# Services with near-zero HTTP request rate over 30 days
sum(rate(http_requests_total[30d])) by (service) < 0.001
Any service matching both criteria is a decommission candidate. Build a list, assign owners, confirm whether the service is genuinely unused, and terminate. Document the process—you'll repeat it quarterly.
Kubernetes Goldilocks (from Fairwinds) automates the detection step, scanning every namespace and flagging over-provisioned and under-utilized workloads with specific remediation recommendations. It integrates with Vertical Pod Autoscaler recommendations to show the delta between current requests and optimal requests.
7. Horizontal Pod Autoscaler Optimization
Typical savings: 15–25% for traffic-variable workloads
HPA scales pod replicas based on CPU utilization, memory, or custom metrics. The default configuration—scale on CPU at 50% target utilization—often results in over-provisioning at steady state: teams set minimum replicas conservatively, and the cluster runs more pods than traffic requires during low-traffic periods.
Tuning HPA for Cost Efficiency
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2 # Lower bound — confirm this handles your minimum traffic
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Raise from default 50% to 70% — allows higher utilization before scaling
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min of sustained low utilization before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
For services with predictable traffic patterns, KEDA (Kubernetes Event-Driven Autoscaling) scales on custom metrics—queue depth, event rate, request latency—with the ability to scale to zero during idle periods. A batch processor that handles jobs from an SQS queue and has zero jobs queued should have zero pods running. KEDA makes this straightforward.
8. Cost Monitoring and Attribution
Impact: enables and sustains all other optimizations
Every optimization above produces a one-time improvement. Cost monitoring converts that improvement into a continuous practice. Without visibility at the namespace/service/team level, cost regressions are invisible until they show up on the monthly cloud bill.
Kubecost
The industry standard for Kubernetes cost monitoring. Kubecost allocates cloud costs to Kubernetes objects—namespaces, deployments, pods, labels—by combining node pricing data with actual resource usage. It produces per-namespace, per-team, and per-deployment cost breakdowns updated in near-real time.
Key features for cost optimization:
- Cost allocation by namespace, label, annotation, or cluster
- Efficiency score per workload (requested vs. actual usage ratio)
- Savings recommendations with estimated impact
- Budget alerts when namespaces exceed spend thresholds
- Request right-sizing recommendations based on observed usage
Kubecost open-source is free for a single cluster. The commercial tier adds multi-cluster support and advanced reporting. For most SMB environments, the open-source tier is sufficient.
OpenCost
OpenCost is the CNCF-incubating open-source project that underpins the cost allocation model. If you want the core allocation engine without Kubecost's UI and opinionated features, OpenCost is the right choice. It exposes a REST API and Prometheus metrics that integrate cleanly into existing observability stacks.
| Tool | Best For | Cost |
|---|---|---|
| Kubecost (OSS) | Single cluster, full UI | Free |
| Kubecost (Commercial) | Multi-cluster, enterprise reporting | $500–$2,000/month |
| OpenCost | Custom integration, Prometheus-native | Free |
| AWS Cost Explorer + tags | Rough service-level breakdown | Free |
| Spot.io (Spot by NetApp) | Spot optimization + cost visibility | % of savings |
Tagging Strategy
Cost attribution only works with consistent labeling. Enforce these labels on all workloads:
metadata:
labels:
app.kubernetes.io/name: api-service
app.kubernetes.io/component: backend
team: payments
environment: production
cost-center: eng-platform
Use an OPA/Gatekeeper admission policy to reject deployments missing required labels. This enforces attribution hygiene at the point of deployment rather than retroactively.
What a 40% Reduction Actually Looks Like
Here's a representative before/after from a client environment: a 30-person engineering team running a SaaS application on EKS, spending approximately $45,000/month on Kubernetes compute before the engagement.
| Optimization | Monthly Savings | Implementation Effort |
|---|---|---|
| Resource request right-sizing (VPA recommendations) | $8,200 | 2 days |
| Spot node pools for stateless services (65% of workload) | $7,500 | 3 days |
| gp2 → gp3 EBS migration for PVCs | $800 | 2 hours |
| Cluster autoscaler tuning | $2,100 | 2 hours |
| Decommission 8 idle deployments | $1,400 | 1 day |
| HPA min-replica reduction + CPU target increase | $2,800 | 1 day |
| Reserved Instances for on-demand node pool baseline | $3,200 | 1 hour (commitment) |
| Total savings | $26,000/month | ~2 weeks |
$26,000/month on a $45,000 starting point is a 58% reduction. This particular environment had significant right-sizing debt from a period of rapid growth—resource requests had been set generously and never revisited.
More conservative environments with some prior optimization typically land at 25–35%.
Prioritization: Where to Start
If you're approaching this for the first time, run these in order:
- Deploy Kubecost — get visibility before touching anything. You cannot prioritize without data.
- Audit resource requests — pull P99 CPU and memory, compare to current requests, identify the largest deltas.
- Right-size the top 10 deployments by wasted resource cost — 80% of the waste is usually in 20% of the deployments.
- Configure spot node pools — identify interruptible workloads and move them to spot.
- Tune cluster autoscaler scale-down parameters — a one-hour config change.
- Deploy VPA in recommendation mode — let it run for 7 days and review recommendations before applying.
- Set ResourceQuotas per namespace — prevent future regressions.
This sequence can be completed in 2–3 weeks by one experienced engineer. The ongoing discipline is the harder part: reviewing Kubecost recommendations monthly, revisiting HPA and VPA settings when traffic patterns change, and decommissioning idle workloads quarterly.
Getting Help
The technical steps above are well-documented. The hard part is execution: identifying which recommendations are safe to apply, validating that right-sizing doesn't impact tail latency for latency-sensitive services, configuring Karpenter correctly for your specific workload mix, and sustaining the practice over time.
Our Kubernetes management service includes a cost optimization audit as a standard engagement component—covering resource right-sizing analysis, spot node pool configuration, autoscaler tuning, and Kubecost deployment with attribution reporting.
Book a free infrastructure assessment and we'll audit your current K8s spend, identify your top five savings opportunities by dollar impact, and provide a remediation plan with implementation effort estimates.
Running a specific Kubernetes cost problem—unusual NAT gateway charges, Fargate vs. EC2 cost questions, PVC storage optimization? Contact us and we'll take a look.
Put this into practice
Get a free assessment of your current security and infrastructure posture, or check your email security in 30 seconds.
Related Services
Related Articles
15 AWS Cost Savings We Find in Every Audit
The exact cost leaks PlatOps finds in every AWS environment—idle resources, oversized instances, NAT gateway waste, S3 lifecycle gaps, and more—with typical savings per item.
AWS vs Azure vs GCP: Which Cloud is Right for Your Business?
An unbiased comparison of the major cloud providers, focusing on security features, compliance support, and total cost of ownership for SMBs.
DevOps Team Structure: How to Build (or Buy) the Right Team
Compare DevOps team models—embedded, platform, hybrid—with ideal team sizes by company stage, required roles, and when to outsource versus hire.
Get articles like this in your inbox
Practical security, infrastructure, and DevOps insights for teams in regulated industries. Published weekly.