Skip to main content
Back to Blog
Cloud

Kubernetes Cost Optimization: How to Cut Your K8s Bill by 40%

A practitioner's guide to cutting Kubernetes costs by 40%—resource requests, right-sizing pods, spot nodes, cluster autoscaler, namespace quotas, idle workload detection, and cost monitoring tools.

PlatOps Team
Author
February 21, 2026
13 min read

Kubernetes makes it easy to run workloads. It also makes it remarkably easy to waste money at scale.

The resource model that gives you flexibility—request what you need, get scheduled accordingly—becomes a cost liability when those requests are set incorrectly. A pod requesting 2 CPU and 4GB RAM that actually uses 0.3 CPU and 600MB RAM is consuming capacity on a node that cost real money to provision, even though most of that capacity sits idle.

Multiply that pattern across dozens of services, add in nodes that were never right-sized, a cluster autoscaler that scales up aggressively but scales down slowly, and no tooling to surface what anything actually costs—and you have the typical Kubernetes environment we see when we start working with a new client.

The 40% figure in this title isn't aspirational. It's the average we see across environments with $20,000–$200,000/month in K8s spend when all the levers in this guide are applied. Some environments come in higher (50–60%) when the baseline is especially undisciplined. Some come in at 25–30% when the team has already done partial optimization work.

This post covers every lever worth pulling, in order of typical impact.


Why Kubernetes Cost Is Hard to Control

Before getting into fixes, it helps to understand why K8s costs get out of hand even in well-run engineering organizations.

Resource requests are set once and never revisited. A deployment manifest has resource requests defined when the service is first written. Those requests reflect the engineer's best guess at the time, usually with conservative headroom. Six months later, the service has been optimized, traffic patterns are understood, and the original requests are 3x what the workload actually needs. Nobody goes back to tune them.

Costs are invisible at the workload level. Your AWS bill tells you what you spent on EC2 and EKS. It does not tell you which namespace, deployment, or team spent what. Without cost attribution, there's no incentive for individual teams to optimize their resource footprint.

Cluster autoscaler behavior is asymmetric by default. Nodes scale up quickly (good for reliability) and scale down slowly (bad for cost). The default scale-down delay is 10 minutes, but conservative configurations often push this to 30–60 minutes. A node that becomes underutilized at 2am may not be reclaimed until morning.

Spot/preemptible nodes are underused. Most teams know spot instances exist. Most teams haven't gone through the work of identifying which workloads can tolerate interruption and configuring the node pool separation required to use them safely.

Each of these has a concrete fix. Here's how to approach them systematically.


1. Fix Resource Requests and Limits First

Typical savings: 20–35% of compute costs

This is the highest-leverage item in most environments and the one that unlocks everything else. Cluster autoscaler cannot right-size your nodes if your pods are requesting 4x what they actually use—the scheduler sees demand that doesn't reflect reality, and the cluster stays larger than it needs to be.

Requests vs. Limits: The Distinction That Matters

Requests are what the Kubernetes scheduler uses to place pods on nodes. A pod requesting 1 CPU will only be scheduled on a node that has 1 CPU available. Requests directly determine how much node capacity you need—and therefore how much you spend.

Limits cap how much a pod can consume. A pod with a CPU limit of 2 CPU can burst up to 2 CPU if the node has capacity available, but it is guaranteed only what it requested. A pod with a memory limit that it exceeds gets OOMKilled.

The common failure modes:

  • Requests too high, limits too high: Pods reserve capacity they don't use. Nodes appear full but are actually underutilized. New pods can't schedule, triggering premature scale-up.
  • Requests too low, no limits: Pods compete for resources unpredictably, causing noisy-neighbor problems and intermittent performance degradation.
  • No requests or limits set: Kubernetes assigns Burstable or BestEffort QoS class; workloads become unpredictable under load.

How to Right-Size Requests

Pull actual usage data before changing anything. For each deployment, collect P99 CPU and P99 memory usage over 7–14 days using Prometheus queries:

# P99 CPU usage by container over 7 days
quantile_over_time(0.99,
  rate(container_cpu_usage_seconds_total{container!=""}[5m])[7d:5m]
)

# P99 memory usage by container over 7 days
quantile_over_time(0.99,
  container_memory_working_set_bytes{container!=""}[7d]
)

Set requests at P99 usage × 1.2 (20% headroom). Set memory limits at P99 × 1.5 to absorb spikes without OOMKill. Set CPU limits generously or not at all—CPU throttling is a common, silent performance problem that's worse than the cost of leaving CPU limits unset.

VPA as an Automation Layer

The Kubernetes Vertical Pod Autoscaler (VPA) can automate request recommendations. Run VPA in Recommendation mode first—it analyzes usage and publishes recommendations without applying them. Review the recommendations against your P99 data, then apply selectively.

VPA in Auto mode applies recommendations automatically with pod restarts. Use with caution for stateful workloads; it's well-suited for stateless services with low restart sensitivity.


2. Right-Size Your Node Types

Typical savings: 10–20% of node costs

Node type selection has a compounding effect: if your pods are right-sized, your node pool configuration determines how efficiently those pods pack onto nodes. A cluster running all m5.4xlarge nodes (16 vCPU / 64GB) will waste more capacity than a cluster running a mix of m5.2xlarge and m5.xlarge nodes—because pods that need 2 CPU / 4GB can't fill an m5.4xlarge without a lot of company.

Bin packing analysis: For each node pool, look at the actual allocatable CPU and memory, and compare to the aggregate requests of the pods scheduled there. If nodes are consistently 30–40% allocated but "full" (no more pods fitting), you have a bin-packing problem—your pod sizes don't fit efficiently into your node sizes.

Tools for this analysis:

  • Karpenter (AWS-native, recommended over cluster-autoscaler for new environments) — selects node types dynamically based on pending pod requirements, choosing the cheapest instance type that satisfies the workload
  • Kubecost → Cluster Efficiency page → shows allocatable vs. requested vs. actual utilization per node pool
  • kubectl-resource-capacity — CLI plugin, shows per-node allocation vs. utilization at a glance

3. Spot and Preemptible Nodes for Interruptible Workloads

Typical savings: 60–80% on covered compute (spot discount vs. on-demand)

Spot instances (AWS) and preemptible VMs (GCP) offer 60–80% discounts in exchange for the possibility of interruption with 2 minutes' notice. For the right workloads, this is the highest raw discount available in cloud compute.

The key is workload classification. Not everything can run on spot—but more can than most teams assume.

Good candidates for spot nodes:

  • Batch and ETL jobs (can checkpoint and retry)
  • CI/CD build runners (ephemeral by design)
  • Stateless microservices with multiple replicas (one interruption doesn't cause an outage)
  • Dev and staging environments (tolerance for occasional disruption)
  • Data processing pipelines with restart capability

Should stay on on-demand:

  • Stateful services (databases, message queues, distributed storage)
  • Single-replica deployments (one interruption = outage)
  • Services with strict latency SLAs where pod restart is costly
  • Control plane components

Implementation Pattern

Use node pool taints and pod tolerations to separate workloads cleanly:

# Spot node pool taint (apply to all spot nodes)
taints:
  - key: "spot"
    value: "true"
    effect: "NoSchedule"

# Pod toleration for spot-eligible workloads
tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

# Node affinity preference (prefers spot, falls back to on-demand)
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: "node.kubernetes.io/lifecycle"
              operator: In
              values: ["spot"]

For AWS, configure your node groups to use a diversified instance type list across multiple families and sizes—this dramatically reduces interruption frequency by sourcing capacity from multiple spot pools.

Karpenter handles this more elegantly than managed node groups: you define a NodePool with a list of compatible instance families, and Karpenter selects the cheapest available type at scheduling time. Combined with spot prioritization, Karpenter consistently delivers 50–70% node cost reduction versus static on-demand node groups.


4. Cluster Autoscaler Configuration

Typical savings: 10–20% by reducing idle node time

The cluster autoscaler adds nodes when pods can't schedule and removes nodes when they become underutilized. The default configuration is conservative on scale-down—which is safe but expensive.

Key Parameters to Tune

# cluster-autoscaler deployment args
- --scale-down-delay-after-add=5m         # default: 10m — how long after scale-up before scale-down is considered
- --scale-down-unneeded-time=5m           # default: 10m — how long a node must be underutilized before removal
- --scale-down-utilization-threshold=0.5  # default: 0.5 — node is "unneeded" if total requests < 50% of allocatable
- --max-node-provision-time=15m           # fail fast on node provisioning issues
- --skip-nodes-with-local-storage=false   # if you don't run local storage, set this to false

Reducing scale-down-unneeded-time from 10 minutes to 5 minutes has minimal reliability impact (a node that's been underutilized for 5 minutes is very unlikely to fill up again) but materially reduces idle node time during traffic troughs. For environments with predictable diurnal traffic patterns, this is a quick win.

Expander strategy: The least-waste expander selects the node group that leaves the least unused CPU/memory after scheduling. It tends to produce better bin-packing than the default random expander. Set --expander=least-waste.

Scale-Down Blockers

Check what's preventing scale-down in your cluster. Common blockers:

  • Pods with no PodDisruptionBudget that the autoscaler is reluctant to evict
  • kube-system pods that can't be moved (DaemonSets, static pods)
  • Pods with local storage attached
  • Pods stuck in Terminating state

Run kubectl get events --field-selector reason=NotTriggerScaleDown to see autoscaler decisions and their reasons.


5. Namespace Resource Quotas

Typical savings: indirect — prevents cost surprises and enforces team accountability

ResourceQuotas impose hard limits on the total CPU and memory that can be requested within a namespace. Without them, a single deployment with misconfigured requests can consume the entire cluster's allocatable capacity and trigger unnecessary node scale-up.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"

LimitRanges complement quotas by enforcing per-pod defaults and bounds—if a pod doesn't specify requests, LimitRange applies defaults; if a pod exceeds the max, it's rejected.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-payments
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "8Gi"
      type: Container

The cost impact is indirect: quotas prevent uncontrolled resource consumption and make teams responsible for their own footprint. Paired with cost attribution tooling, quotas create the accountability structure that sustains long-term optimization.


6. Detect and Eliminate Idle Workloads

Typical savings: $500–$5,000/month depending on environment size

Idle workloads are deployments that are running and consuming resources but serving no meaningful traffic. Common causes: feature flags that are permanently disabled, migrated services that weren't decommissioned, staging deployments that outlasted their sprint, and shadow services that were never promoted.

Detection Approach

Combine two signals: low CPU utilization and low request rate over a 30-day window.

# Deployments with average CPU < 1% of request over 30 days
avg_over_time(
  rate(container_cpu_usage_seconds_total[5m])[30d:5m]
) / on(pod) kube_pod_container_resource_requests{resource="cpu"}
< 0.01

# Services with near-zero HTTP request rate over 30 days
sum(rate(http_requests_total[30d])) by (service) < 0.001

Any service matching both criteria is a decommission candidate. Build a list, assign owners, confirm whether the service is genuinely unused, and terminate. Document the process—you'll repeat it quarterly.

Kubernetes Goldilocks (from Fairwinds) automates the detection step, scanning every namespace and flagging over-provisioned and under-utilized workloads with specific remediation recommendations. It integrates with Vertical Pod Autoscaler recommendations to show the delta between current requests and optimal requests.


7. Horizontal Pod Autoscaler Optimization

Typical savings: 15–25% for traffic-variable workloads

HPA scales pod replicas based on CPU utilization, memory, or custom metrics. The default configuration—scale on CPU at 50% target utilization—often results in over-provisioning at steady state: teams set minimum replicas conservatively, and the cluster runs more pods than traffic requires during low-traffic periods.

Tuning HPA for Cost Efficiency

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 2          # Lower bound — confirm this handles your minimum traffic
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70   # Raise from default 50% to 70% — allows higher utilization before scaling
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min of sustained low utilization before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60

For services with predictable traffic patterns, KEDA (Kubernetes Event-Driven Autoscaling) scales on custom metrics—queue depth, event rate, request latency—with the ability to scale to zero during idle periods. A batch processor that handles jobs from an SQS queue and has zero jobs queued should have zero pods running. KEDA makes this straightforward.


8. Cost Monitoring and Attribution

Impact: enables and sustains all other optimizations

Every optimization above produces a one-time improvement. Cost monitoring converts that improvement into a continuous practice. Without visibility at the namespace/service/team level, cost regressions are invisible until they show up on the monthly cloud bill.

Kubecost

The industry standard for Kubernetes cost monitoring. Kubecost allocates cloud costs to Kubernetes objects—namespaces, deployments, pods, labels—by combining node pricing data with actual resource usage. It produces per-namespace, per-team, and per-deployment cost breakdowns updated in near-real time.

Key features for cost optimization:

  • Cost allocation by namespace, label, annotation, or cluster
  • Efficiency score per workload (requested vs. actual usage ratio)
  • Savings recommendations with estimated impact
  • Budget alerts when namespaces exceed spend thresholds
  • Request right-sizing recommendations based on observed usage

Kubecost open-source is free for a single cluster. The commercial tier adds multi-cluster support and advanced reporting. For most SMB environments, the open-source tier is sufficient.

OpenCost

OpenCost is the CNCF-incubating open-source project that underpins the cost allocation model. If you want the core allocation engine without Kubecost's UI and opinionated features, OpenCost is the right choice. It exposes a REST API and Prometheus metrics that integrate cleanly into existing observability stacks.

ToolBest ForCost
Kubecost (OSS)Single cluster, full UIFree
Kubecost (Commercial)Multi-cluster, enterprise reporting$500–$2,000/month
OpenCostCustom integration, Prometheus-nativeFree
AWS Cost Explorer + tagsRough service-level breakdownFree
Spot.io (Spot by NetApp)Spot optimization + cost visibility% of savings

Tagging Strategy

Cost attribution only works with consistent labeling. Enforce these labels on all workloads:

metadata:
  labels:
    app.kubernetes.io/name: api-service
    app.kubernetes.io/component: backend
    team: payments
    environment: production
    cost-center: eng-platform

Use an OPA/Gatekeeper admission policy to reject deployments missing required labels. This enforces attribution hygiene at the point of deployment rather than retroactively.


What a 40% Reduction Actually Looks Like

Here's a representative before/after from a client environment: a 30-person engineering team running a SaaS application on EKS, spending approximately $45,000/month on Kubernetes compute before the engagement.

OptimizationMonthly SavingsImplementation Effort
Resource request right-sizing (VPA recommendations)$8,2002 days
Spot node pools for stateless services (65% of workload)$7,5003 days
gp2 → gp3 EBS migration for PVCs$8002 hours
Cluster autoscaler tuning$2,1002 hours
Decommission 8 idle deployments$1,4001 day
HPA min-replica reduction + CPU target increase$2,8001 day
Reserved Instances for on-demand node pool baseline$3,2001 hour (commitment)
Total savings$26,000/month~2 weeks

$26,000/month on a $45,000 starting point is a 58% reduction. This particular environment had significant right-sizing debt from a period of rapid growth—resource requests had been set generously and never revisited.

More conservative environments with some prior optimization typically land at 25–35%.


Prioritization: Where to Start

If you're approaching this for the first time, run these in order:

  1. Deploy Kubecost — get visibility before touching anything. You cannot prioritize without data.
  2. Audit resource requests — pull P99 CPU and memory, compare to current requests, identify the largest deltas.
  3. Right-size the top 10 deployments by wasted resource cost — 80% of the waste is usually in 20% of the deployments.
  4. Configure spot node pools — identify interruptible workloads and move them to spot.
  5. Tune cluster autoscaler scale-down parameters — a one-hour config change.
  6. Deploy VPA in recommendation mode — let it run for 7 days and review recommendations before applying.
  7. Set ResourceQuotas per namespace — prevent future regressions.

This sequence can be completed in 2–3 weeks by one experienced engineer. The ongoing discipline is the harder part: reviewing Kubecost recommendations monthly, revisiting HPA and VPA settings when traffic patterns change, and decommissioning idle workloads quarterly.


Getting Help

The technical steps above are well-documented. The hard part is execution: identifying which recommendations are safe to apply, validating that right-sizing doesn't impact tail latency for latency-sensitive services, configuring Karpenter correctly for your specific workload mix, and sustaining the practice over time.

Our Kubernetes management service includes a cost optimization audit as a standard engagement component—covering resource right-sizing analysis, spot node pool configuration, autoscaler tuning, and Kubecost deployment with attribution reporting.

Book a free infrastructure assessment and we'll audit your current K8s spend, identify your top five savings opportunities by dollar impact, and provide a remediation plan with implementation effort estimates.


Running a specific Kubernetes cost problem—unusual NAT gateway charges, Fargate vs. EC2 cost questions, PVC storage optimization? Contact us and we'll take a look.

Put this into practice

Get a free assessment of your current security and infrastructure posture, or check your email security in 30 seconds.

Tags:kubernetesk8scloud-costkubernetes-cost-optimizationfinopsdevops

Get articles like this in your inbox

Practical security, infrastructure, and DevOps insights for teams in regulated industries. Published weekly.

Weekly digestUnsubscribe anytimeNo spam, ever

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Want to Discuss This Topic?

Schedule a call with our team to discuss how these concepts apply to your organization.

30 Minutes

Quick, focused conversation

Video or Phone

Your preferred format

No Sales Pitch

Honest, practical advice

Schedule Strategy Call
Get Free Assessment