Making sense of requests for CPU resources in Kubernetes

Kubernetes allows a container to request several resource types:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: my-app
    image: images.example/my-app
    resources:
      requests:
        cpu: "100m"
        memory: "64Mi"
      limits:
        cpu: "500m"
        memory: "128Mi"

One particularly confusing type of the resource for me was cpu. For example, in the manifest above, the my-app container declares a request for “100m” of the CPU. What does that mean?

Kubernetes’s documentation on managing containers resources describes that as following:

Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors. CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

A misleading (but fairly common) mental modal amongst the developers, is that the application inside the Pod’s container is “boxed” by the container runtime, as if the application received a dedicated slice of the machine’s resources. That’s not exactly how it works. At least, it’s not if Kubernetes uses Docker as the underlying container runtime.

I find it easier to think about the requested resources as a way for an application to “hint” to Kubernetes scheduler about how much resources the application thinks it needs.

Kubernetes scheduler keeps the “accounting” for how much of total resources the tenants of the cluster requested, and how much the cluster’s machines have to offer. It’s important to stress that the “accounting” is done only base on the requests for the resources. That is, the scheduler doesn’t check if the container uses the resources it requested. If a container requested more than half of the node’s CPU resources, e.g. “1100m” on a 2 vCPU node, Kubernetes won’t deploy more than a single replica of this container to that node, nomatter if the application inside the container is idle.

Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails.

What has been requested, has been booked.

Of course, “accounting” of the existing resources is only half of the story. In the same documentation about managing containers resources, they talk about the implementation details of what happens after a container gets scheduled to a node, refering to Docker’s documentation about CPU share constraint. This details for how Docker itself juggles the CPU shares between the running containers can be even more confusing but the interesting part to remember is the following:

The proportion [of CPU cycles] will only apply when CPU-intensive processes are running. When tasks in one container are idle, other containers can use the left-over CPU time. [..] On a multi-core system, the shares of CPU time are distributed over all CPU cores. Even if a container is limited to less than 100% of CPU time, it can use 100% of each individual CPU core.

The important difference between the ideas of a “box” and a “hint” is that the “hint” doesn’t prevent the application from consuming the whole node’s CPU resources, when it requested only half of it — containers are not the VMs, afterall. Using the same example of a cluster’s node with 2 vCPU, a multi-threaded application inside a container still sees two CPU cores. If there are no other tenants on the node, there is no one who the application has to compete for its share of CPU resources.

As mentioned earlier, this applies to Docker container runtime. Don’t forget to consult with the documentation provided by the particular Kubernetes distribution when you make the decisions, that includes the fine-tunning of cluster resources. Docker is still the runtime AWS EKS uses for Kubernetes up to version 1.21. But things might work differently for your cluster as Kubernetes providers switch to alternative container runtimes.

Let’s discuss this note on Hacker News and Twitter.

All topics

aarch64arduinoarm64askmeberlinbuildkitcgocoffeecontinuous profilingdesigndockere-paperesp8266firefoxgithub actionsgogo time fmhomelabIPv6k3skubernetesmacosmDNSndppdobjective-cPostgreSQLpprofprofeferandomraspberry pitravis civs codewaveshareµ-benchmarks