Handle Ops stuff like a developer would

Everything in version control…

…because YAML is text

Use branches for stages (e.g. dev, qa, live)

Pipeline to deploy to stages

Integrate changes using pull/merge requests

Add automated tests to pipeline

Changes are pushed into Kubernetes cluster

Cluster access 1/

Different approches to access the cluster from a pipeline

Inside cluster

Pipeline runs inside the target cluster

Direct API access with RBAC

Next to cluster

Pipeline runs somewhere else…

…or does not have direct access to Kubernetes API

Pipeline fetches (encrypted) kubeconfig

Useful tools

Validate YAML using yamllint

helm template my-ntpd ../helm/ntpd/ >ntpd.yaml
yamllint ntpd.yaml
cat <<EOF >.yamllint
    indent-sequences: consistent
yamllint ntpd.yaml

Validate against official schemas using kubeval :

kubeval ntpd.yaml

Static analysis using kube-linter

kube-linter lint ntpd.yaml
kube-linter lint ../helm/ntpd/
kube-linter checks list

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
 name: my-app-hpa
   apiVersion: apps/v1
   kind: Deployment
   name: my-app
 minReplicas: 1
 maxReplicas: 10
 - type: Resource
     name: cpu
       type: Utilization
       averageUtilization: 50

Horizontal pod
autoscaler (HPA) 1/

Manually scaling pods is time consuming

HPA changes replicas automagically

Supports CPU and memory usage


Deploy nginx and HPA

Create load and watch hpa scale nginx

Horizontal pod autoscaler (HPA) 2/2


Prerequisites: metrics-server

Checks every 15 seconds

Calculates the required number of replicas:

= ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

Configurable behaviour:

Scheduling 1/

Control where pods are placed


Resource requests are important for scheduling

Limits are important for eviction

You want (requests == limits)

Pods will not be evicted…

…because resource consumption is known at all times

Scheduling 2/2

Control where pods are placed

Node selector

Force pods onto specific nodes


Force pods on the same node or on different nodes

Tains / tolerations

Reserve nodes for specific pods (taints)

Pods must accept taints (tolerations)

Lessons Learnt 1/

Avoid kubectl create <resource>

kubectl create is not idempotent

Next pipeline run will fail because resource already exists

Instead create resource definition on-the-fly:

kubectl create secret generic foo \
    --from-literal=bar=baz \
    --dry-run=client \
    --output=yaml \
| kubectl apply -f -

Lessons Learnt 2/

Wait for reconciliation

Reconciliation takes time

Do not use sleep after apply, scale, delete

Let kubectl do the waiting:

helm upgrade --install my-nginx bitnami/nginx \
    --set service.type=ClusterIP
kubectl rollout status deployment my-nginx --timeout=15m
kubectl wait pods \
    --for=condition=ready \

Works for jobs as well:

kubectl wait --for=condition=complete job/baz

Lessons Learnt 3/

Avoid hardcoded names

Finding the pod name is error prone

Filter by label:

helm upgrade --install my-nginx bitnami/nginx \
    --set service.type=ClusterIP \
    --set replicaCount=2
kubectl delete pod --selector

Show logs of the first pod of a deployment:

kubectl logs deployment/my-nginx

Show logs of multiple pods at once with stern :

stern --selector

Lessons Learnt 4/

Troubleshooting individual pods

When a pod is broken, it can be investigated

Remove a label to exclude it from ReplicaSet, Deployment, Service

helm upgrade --install my-nginx bitnami/nginx \
    --set service.type=ClusterIP \
    --set replicaCount=2
kubectl get pods -l -o name \
| head -n 1 \
| xargs -I{} kubectl label {}

ReplicaSet replaces missing pod

Remove after troubleshooting

kubectl logs --selector '!'
kubectl delete pod \
    -l ',!'

Lessons Learnt 5/

Use plaintext in Secret

Templating becomes easier when inserting plaintext

  foo: bar

Do not store resource descriptions after templating

cat secret.yaml \
| envsubst \
| kubectl apply -f -

Lessons Learnt 6/

Update dependencies

Outdated Ops dependencies are also a (security) risk

Tools will be missing useful features

Services can contain vulnerabilities

Renovate/Dependabot FTW

Let bots do the work for you

Doing updates regularly is easier

Automerge for patches can help stay on top of things

Automated tests help decide whether an update is safe