Kubernetes Cost Optimization: How to Stop Wasting 40% of Your K8s Budget

Kubernetes is the most powerful platform for running cloud-native applications — and one of the easiest platforms to waste enormous amounts of money on. The flexibility that makes K8s great (any workload, any scale, any configuration) also makes it trivially easy to provision vastly more compute than your workloads actually need. Industry analysis from CNCF and Kubecost consistently shows that the average Kubernetes cluster has 40–60% of its allocated resources sitting idle — provisioned but unused.

Figma saved $2.1M/year by right-sizing their node pools. Datadog reduced their EKS bill by 38% through aggressive pod resource optimization. These are not outliers — they're the expected outcome of systematic K8s cost optimization in over-provisioned environments. This guide shows you exactly how to get there.

40–60%

Of K8s allocated resources are typically idle

$2.1M

Saved by Figma from K8s node pool rightsizing

38%

EKS cost reduction Datadog achieved via pod optimization

Why Kubernetes Waste Happens: The Over-Provisioning Problem

In Kubernetes, every container declares its resource requests (what the scheduler uses to place the pod on a node) and limits (the maximum the container can use). The node must have enough capacity to satisfy the sum of all pod requests on it — regardless of whether pods actually use those resources.

The classic over-provisioning pattern: a developer sets requests: cpu: 1000m, memory: 2Gi and limits: cpu: 2000m, memory: 4Gi without measuring actual usage. The application uses 100m CPU and 200Mi memory at peak. The pod takes up 10x more node capacity than it needs. Multiply this across 50 services and 200 pods and you're paying for 5–10x the compute you actually consume.

The fundamental K8s cost equation: You pay for node compute capacity (EC2 instances), not pod utilization. If your nodes are 30% utilized on average, you're paying 3x what you need to pay. Every optimization strategy below targets either increasing utilization (packing more pods per node) or reducing node size (matching capacity to actual workload needs).

Step 1: Measure Actual Resource Utilization

You can't rightsize what you can't measure. Start with Metrics Server and kubectl top:

# Install Metrics Server if not present
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check actual CPU and memory usage per pod
kubectl top pods --all-namespaces --sort-by=cpu

NAMESPACE    NAME                          CPU(cores)  MEMORY(bytes)
production   api-deployment-7d9b4-xkp2n    45m         187Mi
production   worker-deployment-6c8f3-jkl9  12m         94Mi
staging      api-deployment-5c7a1-pqr8     8m          156Mi
staging      worker-deployment-9d2e4-mno3  3m          78Mi

# Compare actual usage to requests
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | {
    name: .metadata.name,
    namespace: .metadata.namespace,
    cpu_request: .spec.containers[].resources.requests.cpu,
    mem_request: .spec.containers[].resources.requests.memory
  }'

For systematic analysis, Kubecost (free community edition) gives you a cluster-wide view of request vs actual utilization, broken down by namespace, deployment, and label. Install it with:

helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost --create-namespace \
  --set kubecostToken="your-token" \
  --set prometheus.nodeExporter.tolerations[0].operator="Exists"

Step 2: Implement Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler automatically analyzes pod resource usage over time and recommends (or automatically sets) optimal requests values. In "Recommendation" mode, VPA shows you what it would set without changing anything — a safe way to see the opportunity before committing.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-deployment
  updatePolicy:
    updateMode: "Off"  # "Off" = recommendations only, no auto-update
    # Change to "Auto" once you trust the recommendations
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 50m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi

After 24–72 hours, check VPA recommendations:

kubectl describe vpa api-vpa -n production

# Output shows:
# Container Recommendations:
#   Container Name:  api
#   Lower Bound:
#     Cpu:     45m
#     Memory:  192Mi
#   Target:
#     Cpu:     125m      ← VPA recommendation
#     Memory:  256Mi     ← vs your current 2Gi request
#   Upper Bound:
#     Cpu:     500m
#     Memory:  512Mi

In the example above, VPA recommends 256Mi memory vs your current 2Gi request — an 87% reduction in memory allocation for this pod. Across a cluster, VPA recommendations typically reduce total requested resources by 30–60%, directly enabling you to run the same workloads on fewer or smaller nodes.

Step 3: Configure Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on CPU, memory, or custom metrics. The most common K8s cost mistake is running a static replica count (e.g., replicas: 10) that's sized for peak traffic 24/7. If your peak is 3 hours/day and you're running 10 replicas around the clock, you're paying for 10 replicas worth of capacity 21 hours/day unnecessarily.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Target 60% CPU utilization per pod
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

HPA + VPA compatibility: Running HPA and VPA together can cause conflicts when both try to modify the same deployment simultaneously. The safe pattern is: use HPA for replica scaling (horizontal) and VPA in "Off" mode (recommendation only) for resource right-sizing guidance. If you want automatic VPA + HPA, use KEDA (Kubernetes Event-Driven Autoscaling) as a replacement for HPA, which integrates with VPA cleanly.

Step 4: Karpenter for Node Group Optimization

Karpenter (AWS's next-generation node autoscaler, now used by GKE and Azure too) replaces the Cluster Autoscaler with a smarter approach: instead of scaling predefined node groups, Karpenter provisions exactly the right instance type for each pending pod's requirements. This flexibility dramatically reduces the gap between allocated and available capacity.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]  # Allow Graviton for 20% savings
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Prefer Spot
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]  # Compute, memory, general purpose
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
  limits:
    cpu: 1000
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s  # Consolidate underutilized nodes quickly

The consolidationPolicy: WhenUnderutilized setting is key — Karpenter continuously evaluates whether pods can be bin-packed more efficiently and terminates underutilized nodes, replacing them with fewer, better-utilized nodes. This automatic consolidation typically saves an additional 10–20% beyond what manual rightsizing achieves.

Step 5: Address Unused PVCs and Idle Namespaces

Persistent Volume Claims (PVCs) in Kubernetes are backed by EBS volumes (on EKS) or equivalent cloud storage. PVCs created for StatefulSets, databases, and jobs often persist after the workload is deleted. Finding and cleaning them up:

# Find PVCs not mounted by any running pod
kubectl get pvc --all-namespaces | grep -v Bound

# More detailed: PVCs where the pod no longer exists
kubectl get pvc --all-namespaces -o json | \
  jq '.items[] | select(.status.phase == "Released" or
      (.metadata.annotations."volume.beta.kubernetes.io/storage-provisioner" != null and
       .status.phase == "Bound"))' | \
  jq '{name: .metadata.name, namespace: .metadata.namespace,
       storage: .spec.resources.requests.storage}'

Also check for entire idle namespaces — development or feature-branch namespaces that were never cleaned up:

# Find namespaces with no pod activity in the last 30 days
kubectl get ns -o json | jq -r '.items[].metadata.name' | while read ns; do
  pod_count=$(kubectl get pods -n "$ns" --no-headers 2>/dev/null | wc -l)
  if [ "$pod_count" -eq 0 ]; then
    echo "IDLE NAMESPACE: $ns"
  fi
done

Real-World Results: What These Optimizations Deliver

Here's a realistic savings scenario for a team running a 20-node EKS cluster on m5.2xlarge instances in us-east-1:

Current cost: 20 × m5.2xlarge × $0.384/hr = $7.68/hr = $5,530/month
After VPA rightsizing: Workloads fit on 12 nodes → $3,318/month (40% savings)
After Karpenter + Graviton: Mix of m7g.xlarge and m7g.2xlarge for $2,400/month
After HPA (no idle replicas off-peak): Further 15% reduction → ~$2,040/month
Total savings: $3,490/month ($41,880/year) from a cluster that "worked fine" before.

Optimize Your Entire AWS + Kubernetes Bill

Hero Savings covers your AWS infrastructure costs — EC2, RDS, S3, networking — with the same rigor you'd apply to K8s optimization. Get a complete picture of where your cloud money is going and exactly how to reduce it.

Start Free AWS Audit →

Frequently Asked Questions

Should I use VPA in Auto mode or just Recommendation mode?

Start with Recommendation mode ("Off" updateMode). Review recommendations for 1–2 weeks to build confidence that VPA's suggestions are correct for your workloads. Then, move low-risk workloads (batch jobs, internal tools) to Auto mode first. Be cautious with VPA in Auto mode for stateful workloads — it evicts pods to resize them, which can cause downtime if your deployment has a single replica or a stateful pod. For production stateful workloads, manual right-sizing informed by VPA recommendations is often safer.

What's the difference between Karpenter and Cluster Autoscaler?

Cluster Autoscaler scales predefined node groups up or down — it can only add/remove nodes of types you've pre-configured. Karpenter provisions any instance type that satisfies pending pod requirements, choosing the most cost-efficient option in real-time. Karpenter also consolidates underutilized nodes proactively (Cluster Autoscaler is more conservative). The practical result: Karpenter typically achieves 15–30% better node utilization than Cluster Autoscaler through smarter bin-packing and consolidation.

How do I right-size Kubernetes resources without causing OOMKilled errors?

Always set memory requests equal to memory limits for containers with unpredictable memory usage — this prevents the pod from being scheduled on a node with insufficient memory and then OOMKilled. Use VPA recommendations as a floor, not a ceiling — set requests at the 95th percentile of observed usage, not the average. Add a 20–30% buffer above measured peak usage for memory limits. Enable memory metrics collection via the CloudWatch Agent or Prometheus Node Exporter to get accurate memory data before reducing limits.

Is Spot instance usage safe for Kubernetes workloads?

Yes, with the right architecture. Spot instances can be interrupted with 2-minute notice, so workloads on Spot nodes must be able to handle graceful termination. Best practices: (1) Use Karpenter's mixed Spot + On-Demand capacity type to ensure critical workloads always have On-Demand fallback; (2) Set terminationGracePeriodSeconds appropriately for your application; (3) Use pod disruption budgets to ensure minimum availability during Spot interruptions; (4) Avoid stateful workloads (databases, ZooKeeper) on Spot nodes. For stateless API servers, workers, and batch jobs, Spot is safe and delivers 60–80% cost savings.

What's the fastest way to identify the biggest K8s cost waste in my cluster?

Install Kubecost (free community edition) — it provides a cluster-wide cost breakdown by namespace, deployment, and label within minutes of installation. The "Efficiency" view shows the ratio of actual resource usage to requests, sorted by waste amount. Focus on the top 10 deployments by waste amount — they typically account for 80%+ of total over-provisioning. Fix those with VPA recommendations before worrying about the long tail. Kubecost + VPA recommendations together give you a complete picture and implementation path.