Cloud AI Campus
  • Career paths
  • Learning paths
  • Hands-on Labs
Log in Sign up

๐Ÿงช Hands-on lab · 45 min

Horizontal Pod Autoscaling

  1. 1. Resource requests โ€” the prerequisite
  2. 2. Create the HorizontalPodAutoscaler
  3. 3. Wait for metrics-server to report
  4. 4. Generate load and watch it scale up
  5. 5. Remove load and watch it scale down

Resource requests โ€” the prerequisite

The Horizontal Pod Autoscaler (HPA) adjusts a Deployment's replica count based on observed metrics โ€” typically CPU. For the HPA to do math against CPU, every pod must declare how much CPU it expects to use. That number is resources.requests.cpu.

Look at the starter web Deployment:

cat autoscaling/web.yaml

Note the resource block:

resources:
  requests:
    cpu: "100m"
  limits:
    cpu: "200m"

100m is 1/10 of a CPU core. The HPA will compare actual CPU usage against this number, not against the limit.

Apply it:

kubectl apply -f autoscaling/web.yaml

Watch the single replica come up:

kubectl get deployment web

Wait until READY shows 1/1. Click Verify step.

Hint

`kubectl apply -f autoscaling/web.yaml` deploys a single replica with `cpu: 100m` requests.

Create the HorizontalPodAutoscaler

The HPA is a separate object that points at a target Deployment.

Create one declaratively from the CLI:

kubectl autoscale deployment web --cpu-percent=50 --min=1 --max=4

This says: keep web at between 1 and 4 replicas, aiming for an average CPU utilization of 50% of its requested CPU.

See it:

kubectl get hpa

The columns to watch:

  • REFERENCE โ€” the Deployment it controls.
  • TARGETS โ€” <current>/<target> CPU; <unknown>/50% if metrics haven't arrived yet.
  • MINPODS / MAXPODS / REPLICAS โ€” the bounds and the current count.

For the full picture:

kubectl describe hpa web

You may see the HPA was unable to compute the replica count in the Events section while metrics-server is still warming up. That's the next step.

Click Verify step.

Hint

`kubectl autoscale deployment web --cpu-percent=50 --min=1 --max=4`.

Wait for metrics-server to report

The HPA reads CPU usage from metrics-server. k3s doesn't ship one by default; the scenario installer applied the upstream manifest in the background when the lab started.

Check it's running:

kubectl get deployment metrics-server -n kube-system

READY should be 1/1. If it isn't yet, give it a minute โ€” pulling the image takes ~15-30 seconds the first time.

Once the deployment is Ready, ask for live pod metrics:

kubectl top pods

If you see No resources found. or metrics not available yet, wait ~30 seconds and try again โ€” metrics-server polls every 15s and needs a couple of cycles before it has numbers.

Once kubectl top pods returns CPU/memory rows, look at the HPA again:

kubectl get hpa

TARGETS is no longer <unknown>/50% โ€” it shows the live percentage. That means the HPA is now functional.

Click Verify step.

Hint

`kubectl top pods` โ€” once it works, the HPA's CURRENT column stops being `<unknown>`.

Generate load and watch it scale up

The web Deployment expects to be reached via a Service called web. Create it:

kubectl expose deployment web --port=8080 --target-port=8080

Now apply the load generator โ€” a busybox pod that hammers web with wget in a tight loop:

kubectl apply -f autoscaling/loadgen.yaml

One load pod isn't enough to drive web's CPU over 50%. Pile on more:

kubectl scale deployment loadgen --replicas=5

In another tab โ€” or in this same one โ€” watch the HPA reach for more replicas:

kubectl get hpa web -w

The progression takes 1-3 minutes. You should see:

  1. TARGETS climb past 50%.
  2. REPLICAS step up โ€” usually from 1 to 2, then 3, then 4 (capped by max).
  3. kubectl get pods -l run=web shows the new pods.

The HPA is conservative on scale-up by design โ€” it waits to make sure load is sustained, not a transient spike. The default behaviour doubles capacity at most every 15 seconds, capped by maxReplicas.

Click Verify step once kubectl get deployment web shows at least 2 ready replicas.

Hint

`kubectl apply -f autoscaling/loadgen.yaml; kubectl scale deployment/loadgen --replicas=5` drives `web`'s CPU up; replicas climb toward 4.

Remove load and watch it scale down

Remove the load:

kubectl scale deployment loadgen --replicas=0

Confirm no more load pods:

kubectl get pods -l app=loadgen

The HPA scales down more cautiously than it scales up โ€” to avoid flapping, it waits 5 minutes by default before reducing replicas. You can watch the wait:

kubectl describe hpa web | tail -20

The Events: section will show New size: <N> lines as the HPA makes decisions, and the Conditions: section will indicate ScaleDownStabilized while it's waiting.

After ~5 minutes (or earlier if you want to short-circuit the wait):

kubectl get hpa web

REPLICAS settles back to 1.

Production tip: tune behavior.scaleDown.stabilizationWindowSeconds on the HPA spec to trade scale-down responsiveness for stability. Spiky workloads often want a longer window; predictable workloads can shrink it.

For this lab, declaring victory once load is removed and the HPA has acknowledged is enough. Click Verify step.

Hint

`kubectl scale deployment loadgen --replicas=0` โ€” within ~5 minutes the HPA scales `web` back down.

© 2026 Cloud AI Campus