Kubernetes Pod Scaling: A Comprehensive Guide
Overview
Kubernetes provides multiple mechanisms for scaling pods to meet varying workload demands. This document explores the different scaling approaches, their components, and how they work together to maintain application availability and performance.
Types of Pod Scaling
1. Manual Scaling
Direct manipulation of replica counts through kubectl or API calls.
kubectl scale deployment my-app --replicas=5
2. Horizontal Pod Autoscaling (HPA)
Automatically scales the number of pod replicas based on metrics.
3. Vertical Pod Autoscaling (VPA)
Adjusts CPU and memory requests/limits for containers.
4. Cluster Autoscaling
Scales the number of nodes in the cluster based on pod resource requirements.
Horizontal Pod Autoscaler (HPA) Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Metrics API │ │ HPA Controller │ │ Deployment │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ CPU/Mem │ │◄───┤ │ Decision │ │────┤ │ Replica Set │ │
│ │ Metrics │ │ │ │ Engine │ │ │ │ │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Custom │ │ │ │ Scale │ │ │ │ Pods │ │
│ │ Metrics │ │ │ │ Calculator │ │ │ │ (N pods) │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ │ │
│ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Metrics Server │ │ Scaling Events │ │ Pod Lifecycle │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ • Scale Up │ │ • Creating │
│ │ kubelet │ │ │ • Scale Down │ │ • Running │
│ │ cAdvisor │ │ │ • Stabilization │ │ • Terminating │
│ └─────────────┘ │ └─────────────────┘ └─────────────────┘
└─────────────────┘
HPA Scaling Algorithm
Target Replica Calculation
desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
Scaling Decision Flow
┌─────────────────┐
│ Collect Metrics │
└─────────┬───────┘
│
▼
┌─────────────────┐ No ┌─────────────────┐
│ Metrics Valid? │──────────────┤ Skip Scaling │
└─────────┬───────┘ └─────────────────┘
│ Yes
▼
┌─────────────────┐
│ Calculate │
│ Target Replicas │
└─────────┬───────┘
│
▼
┌─────────────────┐ No ┌─────────────────┐
│ Within │──────────────┤ No Action │
│ Tolerance? │ │ Required │
└─────────┬───────┘ └─────────────────┘
│ No
▼
┌─────────────────┐ No ┌─────────────────┐
│ Cooldown │──────────────┤ Wait for │
│ Period Over? │ │ Cooldown │
└─────────┬───────┘ └─────────────────┘
│ Yes
▼
┌─────────────────┐
│ Apply Scaling │
│ Decision │
└─────────────────┘
HPA Configuration Examples
Basic CPU-based HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Multi-metric HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: advanced-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
Vertical Pod Autoscaler (VPA) Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ VPA Recommender │ │ VPA Updater │ │ VPA Admission │
│ │ │ │ │ Controller │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Resource │ │ │ │ Pod │ │ │ │ Webhook │ │
│ │ Analysis │ │ │ │ Eviction │ │ │ │ Injection │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ History │ │ │ │ Resource │ │ │ │ Request/ │ │
│ │ Tracking │ │ │ │ Updates │ │ │ │ Limit Patch │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ Pod Resource Updates │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Pod │ │ Pod │ │ Pod │ │
│ │ CPU: 100m │ │ CPU: 200m │ │ CPU: 150m │ │
│ │ Mem: 128Mi │ │ Mem: 256Mi │ │ Mem: 192Mi │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
VPA Configuration Example
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto" # Off, Initial, Auto
resourcePolicy:
containerPolicies:
- containerName: web-container
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
Cluster Autoscaler Architecture
┌─────────────────────────────────────────────────────────────┐
│ Cluster Autoscaler │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scale Up │ │ Scale Down │ │ Node Group │ │
│ │ Decision │ │ Decision │ │ Management │ │
│ │ │ │ │ │ │ │
│ │ • Pending │ │ • Under- │ │ • AWS ASG │ │
│ │ Pods │ │ utilized │ │ • GCP MIG │ │
│ │ • Resource │ │ Nodes │ │ • Azure │ │
│ │ Requests │ │ • Grace │ │ VMSS │ │
│ │ │ │ Period │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Add Nodes │ │ Remove Nodes │ │ Cloud Provider │
│ │ │ │ │ API │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Node │ │ │ │ Drain & │ │ │ │ Instance │ │
│ │ Provisioning│ │ │ │ Terminate │ │ │ │ Management │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Scaling Event Timeline
Time: 0s 30s 60s 90s 120s 150s 180s
│ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Load │ │Metrics │ │HPA │ │Pods │ │Metrics │ │Scale │
│Increase │ │Exceed │ │Triggers │ │Starting │ │Stabilize│ │Complete │
│ │ │Target │ │Scale Up │ │ │ │ │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
Phase 1: Phase 2: Phase 3: Phase 4: Phase 5:
Detection Decision Execution Provisioning Stabilization
Scaling Policies and Behaviors
Scale-Up Policies
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100 # Double the replicas
periodSeconds: 15 # Every 15 seconds
- type: Pods
value: 4 # Add max 4 pods
periodSeconds: 60 # Every minute
selectPolicy: Max # Choose the larger increase
Scale-Down Policies
scaleDown:
stabilizationWindowSeconds: 300 # 5-minute stabilization
policies:
- type: Percent
value: 50 # Reduce by 50%
periodSeconds: 60 # Every minute
- type: Pods
value: 2 # Remove max 2 pods
periodSeconds: 60 # Every minute
selectPolicy: Min # Choose the smaller reduction
Resource Metrics Flow
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ cAdvisor │ │ Metrics Server │ │ HPA/VPA │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Container │ │────┤ │ Aggregation │ │────┤ │ Decision │ │
│ │ Metrics │ │ │ │ & Storage │ │ │ │ Making │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │
│ • CPU Usage │ │ • 15s samples │ │ • Scale Up/Down │
│ • Memory Usage │ │ • Pod averages │ │ • Resource Adj │
│ • Network I/O │ │ • API exposure │ │ • Policy Apply │
│ • Disk I/O │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Custom Metrics Integration
External Metrics (e.g., Queue Length)
metrics:
- type: External
external:
metric:
name: sqs_messages_visible
selector:
matchLabels:
queue: "worker-queue"
target:
type: AverageValue
averageValue: "30"
Object Metrics (e.g., Ingress RPS)
metrics:
- type: Object
object:
metric:
name: requests_per_second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
target:
type: Value
value: "1000"
Best Practices for Pod Scaling
1. Resource Requests and Limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
2. Readiness and Liveness Probes
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
3. Pod Disruption Budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
Troubleshooting Scaling Issues
Common HPA Problems
- Missing Resource Requests
# Check if pods have resource requests kubectl describe deployment myapp - Metrics Server Issues
# Verify metrics server is running kubectl get pods -n kube-system -l k8s-app=metrics-server # Check if metrics are available kubectl top nodes kubectl top pods - HPA Status Check
# Check HPA status and events kubectl describe hpa myapp-hpa kubectl get hpa myapp-hpa -o yaml
Scaling Decision Logs
# View HPA controller logs
kubectl logs -n kube-system deployment/metrics-server
# Check scaling events
kubectl get events --sort-by=.metadata.creationTimestamp
Performance Considerations
HPA Performance Tuning
- Metric Collection Interval: Default 15 seconds
- Scaling Frequency: Default 30 seconds for scale-up, 5 minutes for scale-down
- Tolerance: Default ±10% to prevent thrashing
- Stabilization Windows: Prevent rapid fluctuations
VPA Performance Impact
- Recommendation Quality: Needs 8 days of historical data
- Resource Overhead: Additional memory usage for tracking
- Pod Restarts: Required for applying new resource limits
Cluster Autoscaler Timing
- Scale-up Decision: 10-30 seconds
- Node Provisioning: 2-5 minutes (cloud provider dependent)
- Scale-down Grace Period: 10 minutes default
- Node Removal: 30-60 seconds
Monitoring and Observability
Key Metrics to Monitor
- HPA Metrics
- Target vs Current replica count
- Scaling events frequency
- Metric collection latency
- Resource Utilization
- CPU/Memory usage patterns
- Request vs limit ratios
- Node resource availability
- Application Performance
- Response time during scaling
- Error rates during pod transitions
- Queue lengths and processing delays
Grafana Dashboard Queries
# Current vs Desired Replicas
kube_horizontalpodautoscaler_status_current_replicas{hpa="myapp-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{hpa="myapp-hpa"}
# Pod CPU/Memory Usage
rate(container_cpu_usage_seconds_total[5m])
container_memory_working_set_bytes
This comprehensive guide covers the essential aspects of Kubernetes pod scaling, providing both theoretical understanding and practical implementation details for effective autoscaling in production environments.