Tech
Kubernetes Node Internals — Part 3: Pod Lifecycle
Part 3 of a 5-part series: the full birth sequence of a Pod, from scheduler decision and pause container to CNI wiring, overlayfs, and a running process.
Kubernetes Node Internals — Part 3: Pod Lifecycle
"You ran
kubectl apply -f pod.yaml. Somewhere around 300ms later, a Linux process is running on a node. Here is every single thing that happened in between."
This is the part of Kubernetes most people use every day, but few people trace end to end.
You declare a Pod. The scheduler picks a node. Then, very quickly, something on that machine turns YAML into:
- a network namespace
- a sandbox
- one or more cgroups
- a writable container filesystem
- one or more Linux processes
That sequence is the birth of a Pod.
And once you see the sequence clearly, container startup stops feeling mysterious.
Series roadmap
- Part 1 — The anatomy of a node
- Part 2 — Bootstrap and the secret handshake
- Part 3 — A pod is born
- Part 4 — Keeping the node alive
- Part 5 — CSI, volumes, and mounts on the node
How kubelet watches for new Pods
By the time this story starts, the scheduler has already chosen a node.
That means the Pod now has a .spec.nodeName, and the kubelet running on that node is responsible for it.
The kubelet continuously watches the API server for changes relevant to that node.
The watch loop on the API server
Conceptually, kubelet maintains a watch-driven loop that keeps asking:
"Which Pods assigned to me should exist right now?"
When a new Pod appears for the node, kubelet does not instantly do everything in one giant blocking function. Instead, it turns the desired work into internal tasks and reconciliation steps.
Pod spec lands in kubelet's work queue
Once kubelet notices the new Pod, it adds that Pod to its internal work pipeline.
From there, kubelet drives a sequence roughly like this:
- prepare volumes and sandbox requirements
- ask the CRI runtime to create the Pod sandbox
- ensure networking is ready
- start init containers if present
- start app containers
- update status back to the API server
Kubernetes loves reconciliation loops, and Pod startup is no exception.
The pause container — the unsung hero
One of the least celebrated pieces in Kubernetes is the pause container.
Beginners often assume a Pod is just a group of containers launched together. But before the app containers start, Kubernetes usually creates a tiny Pod sandbox first.
That sandbox is commonly represented by the pause container.
Why it exists
The pause container exists mainly to provide a stable anchor for the Pod's shared environment.
That includes, most importantly:
- the Pod's network namespace
- and, in configurations that share it, the Pod's PID namespace
Think of it as the object that says:
"This Pod now has a place where shared namespaces live. Other containers can join this place."
Without that anchor, it would be much harder to give multiple containers in the same Pod one shared network identity.
Namespace anchor, created first, dies last
The pause container is created first because other containers need something to join.
It dies last because tearing it down too early would collapse the shared Pod sandbox underneath the app containers.
That is why the Pod sandbox and the app containers are not the same conceptual thing.
PID 1 and zombie reaping
You will often hear that the pause container is "PID 1." The precise details depend on whether the Pod shares a PID namespace, but the intuition is still useful:
- there is a first process in the sandbox context
- that process anchors the shared environment
- if PID namespace sharing is enabled, it can also play the familiar PID 1 role of reaping zombie processes
The main takeaway is not the exact host PID value. It is that the sandbox must have a stable lifetime independent of any single app container.
Network namespace setup
Once kubelet asks the runtime to create the Pod sandbox, networking becomes the first major setup task.
To be slightly more precise, kubelet initiates sandbox creation through the CRI runtime, and the runtime is the component that actually creates the sandbox and invokes the CNI plugin.
kubelet initiates the netns creation, then CNI is called
From a high-level operator perspective, the flow looks like this:
- kubelet says, "Create the Pod sandbox"
- the runtime creates the sandbox and its network namespace
- the runtime invokes the CNI plugin
- the CNI plugin wires host and Pod networking together
CNI plugin wires a veth pair into the bridge
The common mental model is:
- one end of a veth pair goes inside the Pod namespace as
eth0 - the other end remains on the host side
- that host-side end is attached to a bridge or equivalent datapath
This gives the Pod a real Linux network interface inside its namespace.
IP address assigned from the node's PodCIDR
The Pod then receives an IP address, typically from the node's allocated Pod CIDR range.
That means each Pod gets its own IP, which is why Kubernetes networking feels different from classic Docker port-mapping mental models.
In Kubernetes, the default assumption is:
Pods are first-class IP endpoints.
DNS: how /etc/resolv.conf is written
The sandbox also gets DNS configuration.
That is why, inside a Pod, /etc/resolv.conf usually points at cluster DNS and includes the Pod's search domains, allowing names like:
my-servicemy-service.my-namespacemy-service.my-namespace.svc.cluster.local
to resolve the way Kubernetes expects.
CNI call diagram
Init containers, if any
If the Pod defines init containers, Kubernetes runs them before the main app containers.
They execute sequentially and must complete successfully before the next phase begins.
This is useful for setup work like:
- waiting for dependencies
- rendering config files
- database migrations
- downloading assets
Conceptually, init containers are part of the Pod birth sequence because the Pod is not considered fully started until that preparation work has finished.
App container startup
Now we get to the part most people think of as "starting the container." In reality, even this is several sub-steps.
1. Image pull
If the image is not already on the node, the runtime pulls it from a registry.
That includes:
- resolving the image reference
- authenticating if needed
- downloading image layers
2. Layer extraction
Container images are not one giant tarball of a root filesystem. They are usually layered.
The runtime unpacks those layers so they can become the read-only lower layers of the container filesystem.
3. overlay filesystem (overlayfs)
Most Linux container runtimes use overlayfs or an equivalent union filesystem idea.
This allows the container to see:
- shared read-only image layers underneath
- plus a thin writable layer on top
That is why multiple containers using the same image do not each need their own full copy of every file.
overlayfs layer diagram
The container experiences this as one merged root filesystem, even though underneath it is composed of multiple layers.
4. cgroups applied
Before or during container creation, the runtime ensures the appropriate cgroups exist for the Pod and container.
This is where Kubernetes resource policy turns into actual kernel-enforced controls.
5. runc forks the process into the namespaces
Finally, the runtime delegates to runc or another OCI runtime.
That runtime sets up:
- namespaces
- mounts
- cgroups
- environment
- capabilities
- working directory
- entrypoint and arguments
and then launches the real process.
At that moment, the container becomes what it has always been underneath the abstraction:
a Linux process running under carefully prepared constraints.
Pod termination mirror path
Part 3 is mostly about birth, but operational reasoning gets much easier if you also know the reverse sequence.
When a Pod is terminated, the high-level path is usually:
- Pod deletion is requested
- endpoint updates begin removing it from Service backends
- kubelet sends
SIGTERMto container processes preStophooks run if defined- graceful shutdown window (
terminationGracePeriodSeconds) is honored - any remaining processes receive
SIGKILL - sandbox and network resources are torn down
The same node layers are involved, just in reverse order.
Why this symmetry matters
Many production issues are not startup failures but shutdown timing failures:
- process exits too slowly and gets force-killed
- readiness drops too late and traffic arrives during shutdown
- hooks do too much work and exceed grace windows
Understanding startup without termination leaves an important half of Pod lifecycle reasoning missing.
Where storage joins this flow
Networking is the most visible part of sandbox setup, but storage joins the Pod lifecycle too.
During startup, kubelet and the runtime also coordinate storage work such as:
- asking the node-side volume manager to prepare required volumes
- invoking CSI-driven mount work for persistent volumes when needed
- bind-mounting prepared volume paths into the Pod sandbox and containers
- container writable layer creation on top of image layers
- ephemeral storage accounting for logs, writable layers, and
emptyDir
That means Pod startup is not only about "network first, process second." It is also about making sure the container sees the right filesystem paths before the application begins.
This is why DiskPressure and startup behavior are often connected in real clusters.
We are intentionally not going deep on CSI in this part, because it deserves its own mental model. If networking gives the Pod an IP address, storage gives it a usable directory tree backed by something real. That full path is covered in Part 5 — CSI, volumes, and mounts on the node.
kube-proxy's role after pod start
The Pod is now running, but the cluster still needs to learn how to send traffic to it.
If the Pod matches a Service selector and is ready, it becomes part of that Service's backend set.
That causes endpoint data to update, and kube-proxy on nodes can refresh its local routing rules.
Endpoint added, iptables or IPVS rules updated
Once the endpoint is visible, kube-proxy updates the dataplane so Service traffic can reach the new Pod.
That may mean new iptables or ipvs entries, depending on cluster mode.
Service ClusterIP now routes to the Pod
This is the point where traffic sent to the Service can begin reaching the newly started backend.
So Pod birth is not complete merely when the process exists. It is complete when:
- the process is running
- the Pod is network-reachable
- readiness is satisfied
- service routing knows about it
Status reported back to the API server
Throughout the process, kubelet keeps reporting status upward.
That includes transitions such as:
- image pulling
- container creating
- running
- probe failures if any
- restart states
This is what eventually surfaces to you through kubectl get pods and kubectl describe pod.
In other words, the API server is not watching the process directly. The kubelet is the node-local observer that reports what happened.
Startup latency hotspots
If Pod startup feels slow, the delay is usually concentrated in a few stages:
- large image pulls or slow registry auth
- CNI setup and IP allocation latency
- heavy init container work
- probe configuration that delays readiness transition
This gives you a practical triage order before diving into low-level logs.
Key diagram — full birth sequence timeline
Here is the full sequence in one numbered flow.
- You submit
pod.yamlto the API server. - The scheduler chooses a node.
- The Pod is now assigned to that node.
- The kubelet on that node sees the new Pod in its watch loop.
- Kubelet asks the CRI runtime to create a Pod sandbox.
- The runtime creates the pause container and sandbox namespaces.
- The runtime invokes the CNI plugin to configure networking.
- Volumes and DNS configuration are prepared.
- Init containers run, one by one, if defined.
- App images are pulled and unpacked;
overlayfslayers are prepared. runclaunches the app process with the right namespaces and cgroups.- Kubelet reports status, and kube-proxy begins routing Service traffic once endpoints are ready.
If you want the shortest accurate summary of Pod creation, it is this:
kubelet asks the runtime to create a sandbox, wire networking, start containers, and then reports the results back to the control plane.
Concepts introduced
CNI spec — the interface contract
CNI stands for Container Network Interface.
Like CRI, it is a contract. It says, in effect:
"Given a container or sandbox network namespace, here is how a plugin should attach networking and report the result."
This is why Kubernetes can work with many networking implementations while keeping a consistent mental model.
overlayfs — lowerdir, upperdir, workdir
The image layers live in the read-only lowerdir stack.
The container's own writable changes go to upperdir.
The filesystem driver uses workdir internally to manage the merged view.
That is the filesystem reason containers feel lightweight.
Pod sandbox vs app container distinction
This is one of the most useful conceptual distinctions in Kubernetes internals.
| Object | Purpose |
|---|---|
| Pod sandbox | Holds the shared Pod environment, especially networking |
| App container | Runs the application process inside that prepared environment |
The sandbox comes first. The app containers join it.
Once that distinction is clear, the pause container stops feeling weird and starts feeling necessary.
Final mental model
A Pod is not born in one step.
It is born through a chain of lower-level operations:
- watch
- queue
- sandbox
- networking
- filesystem setup
- cgroups
- process launch
- endpoint registration
- status reporting
That is what happened in the few hundred milliseconds between your YAML hitting the API server and your application logging its first line.
In Part 4, we will look at the other side of node life: not how a Pod starts, but how the node stays healthy, proves it is alive, handles pressure, and decides what to evict when things go wrong.
Storage deep dive: Part 5 — CSI, volumes, and mounts on the node