Tech
Graceful gRPC Server Shutdowns Done Right
A production guide on shutting down gRPC servers safely
Graceful gRPC Server Shutdown in Kubernetes: Signals, Draining, and the Failure Modes Nobody Talks About
Most shutdown bugs never show up in happy-path testing.
They appear during rolling deploys, node drains, spot interruptions, autoscaling churn, or the one bad morning when a service is already under pressure and Kubernetes starts moving pods around. That is when you discover whether your server exits like a well-behaved distributed system participant, or like a process that just vanished mid-conversation.
For gRPC services, the shutdown path matters even more than it does for typical REST APIs. HTTP/2 connections are long-lived, streams can stay open for a very long time, and a single TCP connection may carry a large number of in-flight RPCs. If you get termination wrong, clients do not just see a small blip. They see UNAVAILABLE, hanging streams, reset connections, or a wave of retries at exactly the wrong time.
This post walks through graceful shutdown from the infrastructure layer all the way to idiomatic Go implementation.
1. The anatomy of a termination: from hypervisor to your process
A pod delete is not an instant kill. It is a coordinated teardown across several layers.
The chain of command
At a high level, the shutdown path looks like this:
- Hypervisor / node lifecycle event: a node may be drained, preempted, upgraded, or simply host a pod that is being replaced during rollout.
- Kubernetes control plane: the pod gets a
deletionTimestampand entersTerminating. - Kubelet on the node: kubelet notices the pod should stop and begins termination handling.
- Container runtime (CRI): containerd or another runtime delivers the stop signal to the container's main process.
- Your process: your Go binary, usually running as PID 1, receives
SIGTERM.
That last detail matters more than many teams expect: if your binary is hidden behind a shell wrapper and signals never reach the real server process, graceful shutdown logic will never run. This is one reason exec-form container entrypoints are preferred.
SIGTERM is a polite request, not a kill
When Kubernetes decides your pod should stop, the first important signal is usually SIGTERM.
That signal does not mean the pod is gone. It means the shutdown budget has started.
The budget is controlled by terminationGracePeriodSeconds, which defaults to 30 seconds for many deployments. Inside that window, your job is to:
- stop taking new work
- let in-flight work finish
- close dependencies cleanly
- exit before the deadline
If you do not finish in time, kubelet escalates to SIGKILL, and at that point there is no negotiation left.
What happens on the wire: FIN vs RST
This is where graceful shutdown stops being an application concern and becomes a networking concern.
If your process closes connections cleanly, TCP performs an orderly shutdown using a FIN exchange. From the client side, this is the "normal" close path. The peer is saying: I am done sending data; finish what remains and close cleanly.
If your process dies abruptly, the client often experiences the equivalent of a reset path instead. In practice that means a sudden RST, connection reset by peer, transport is closing, or a generic gRPC UNAVAILABLE depending on timing.
That difference matters:
- FIN path: clients can finish reads, observe a clean close, and reconnect with less chaos.
- RST path: in-flight RPCs fail immediately and retries pile up fast.
If you have ever seen a deployment create a sharp but short-lived spike in gRPC client errors, this is often the layer where the story starts.
The race condition most people meet in production
Kubernetes termination has a subtle race:
- the pod starts terminating
- the pod is removed from
EndpointSlice/ Service backends - kube-proxy, ingress, service mesh sidecars, and upstream clients gradually observe that change
- at the same time, your process receives
SIGTERM
These events are related, but they are not perfectly synchronized.
That means there is a short window where:
- your app has already decided to shut down
- but some clients or proxies still believe the pod is a valid target
This is why a small preStop delay is common in production. It buys time for endpoint and load balancer state to converge before your process actually disappears.
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-grpc
spec:
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: app
image: ghcr.io/acme/payments:1.42.0
ports:
- containerPort: 9090
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
readinessProbe:
grpc:
port: 9090
periodSeconds: 5Two important nuances:
- that
sleep 5is not business logic; it is control-plane convergence time - the
preStopdelay consumes your grace period budget
So if terminationGracePeriodSeconds is 30 and preStop sleeps for 5, your app effectively has about 25 seconds left to drain.
Termination timeline
| Time | Layer | What happens | Why it matters |
|---|---|---|---|
t0 | API server | Pod gets deletionTimestamp | The pod is now terminating, but not dead yet |
t0 + a few ms | kubelet | preStop hook runs | Gives the network path time to stop routing new traffic |
t0 + hook end | CRI / process | SIGTERM reaches your app | Your graceful shutdown code must begin immediately |
t0 + seconds | EndpointSlice / proxies / mesh | Traffic gradually drains away | There may still be some late arrivals |
t0 + grace timeout | kubelet | SIGKILL if app is still alive | Any unfinished work is cut off |
2. Graceful shutdown: the GracefulStop() protocol
In grpc-go, there are two very different ways to stop a server.
server.Stop(): the hard stop
server.Stop() is immediate.
- listeners are closed
- active transports are closed
- in-flight RPCs are terminated
This is the right choice only when you have already exhausted your grace budget, or when you are intentionally choosing fail-fast behavior over waiting.
Think of Stop() as the emergency brake.
server.GracefulStop(): the drain path
server.GracefulStop() is the shutdown path you want most of the time.
Its behavior is roughly:
- stop accepting new connections and new RPCs
- let already-running RPCs continue
- wait until the active RPC set reaches zero
- then fully stop the server
Operationally, this is closer to what you want during a rollout: the server becomes unavailable for new work, but it tries hard not to punish the work already in progress.
The trap: GracefulStop() can hang forever
There is one sharp edge that bites many teams.
GracefulStop() has no timeout parameter.
If you have a streaming RPC that stays open indefinitely, the call can block forever. That might happen with:
- server-streaming subscriptions
- bidirectional streams used for agent connections
- long-lived watch APIs
- clients that never close properly after the server marked itself unavailable
So a production-safe pattern is not just:
grpcServer.GracefulStop()It is:
drained := make(chan struct{})
go func() {
defer close(drained)
grpcServer.GracefulStop()
}()
select {
case <-drained:
// all active RPCs finished in time
case <-time.After(25 * time.Second):
// grace budget exhausted
grpcServer.Stop()
}The key idea is simple: try graceful first, then force the issue before Kubernetes does it for you.
3. Resource and connection lifecycle: avoiding the leak
Getting the gRPC server shutdown right is necessary, but it is not sufficient.
Most real services are not just a socket listener. They also own worker pools, queue consumers, database handles, tracing exporters, caches, and background reconciliation loops. A clean process exit requires all of them to wind down coherently.
The "zombie" connection problem
gRPC uses HTTP/2, and HTTP/2 connections are intentionally sticky.
That is usually good for performance:
- one TCP connection can multiplex many RPCs
- connection setup cost is amortized
- latency is lower once channels are warm
But during shutdown, stickiness becomes a liability.
With classic REST intuition, people often assume "remove the pod from the load balancer" means the next request goes elsewhere. That is mostly true for short-lived HTTP/1.1 patterns.
With gRPC, a client may already have a warm HTTP/2 connection to the pod. Even after the pod is removed from service discovery, that existing connection can keep sending RPCs until one side closes it or the client re-resolves and reconnects.
That is why graceful shutdown is really a connection lifecycle problem, not just a process lifecycle problem.
It is not just the gRPC server
When shutdown begins, think in layers:
- gRPC server: stop taking new RPCs
- message consumers: stop pulling new work from Kafka, SQS, Pub/Sub, RabbitMQ, etc.
- background workers: stop scheduling new jobs
- database pool: close only after active handlers are done with it
- observability exporters: flush metrics, traces, and logs before exit
Closing these in the wrong order creates artificial failures. A common anti-pattern is:
- receive
SIGTERM - close database pool immediately
- let active RPC handlers continue running
Now your handlers fail not because the client canceled, but because you pulled the floor out from under them.
The drain pattern
The safest mental model is:
- stop admitting new work
- let current work finish
- close shared dependencies
- exit
For queue consumers, that usually means stop polling new messages first. Then let the currently claimed messages finish processing. Only after the worker pool drains should the process exit.
If you share infrastructure between the gRPC handlers and background consumers, they need a coordinated stop signal, usually a context.Context or a channel.
4. The implementation: Go, channels, and contexts
Here is the core shape of an idiomatic Go shutdown path:
- listen for
SIGTERMandSIGINT - mark the service as not ready
- stop background consumers from taking new work
- start
GracefulStop() - wait for graceful drain or timeout
- force
Stop()if needed - close the remaining resources
Your high-level snippet is exactly the right backbone:
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop
ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
go func() {
s.GracefulStop()
cancel()
}()
<-ctx.Done()In production, you usually want a bit more coordination around health state, background workers, and forced fallback. Here is a more complete example.
package main
import (
"context"
"errors"
"log/slog"
"net"
"os"
"os/signal"
"sync"
"syscall"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/health"
grpc_health_v1 "google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/keepalive"
)
type App struct {
grpcServer *grpc.Server
healthServer *health.Server
stopWorkers context.CancelFunc
workerWG sync.WaitGroup
closers []func() error
logger *slog.Logger
}
func (a *App) Shutdown(timeout time.Duration) error {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
// 1. Fail readiness first so new traffic stops arriving.
a.healthServer.SetServingStatus(
"",
grpc_health_v1.HealthCheckResponse_NOT_SERVING,
)
// 2. Stop background consumers from taking new work.
a.stopWorkers()
// 3. Drain active gRPC RPCs.
grpcDrained := make(chan struct{})
go func() {
defer close(grpcDrained)
a.grpcServer.GracefulStop()
}()
select {
case <-grpcDrained:
a.logger.Info("gRPC server drained cleanly")
case <-ctx.Done():
a.logger.Warn("grace period exhausted, forcing gRPC stop")
a.grpcServer.Stop()
}
// 4. Wait for background workers to finish what they already pulled.
workersDone := make(chan struct{})
go func() {
defer close(workersDone)
a.workerWG.Wait()
}()
select {
case <-workersDone:
a.logger.Info("background workers drained")
case <-ctx.Done():
a.logger.Warn("worker drain timed out")
}
// 5. Close remaining dependencies.
var errs []error
for _, closeFn := range a.closers {
if err := closeFn(); err != nil {
errs = append(errs, err)
}
}
return errors.Join(errs...)
}
func runConsumer(ctx context.Context, wg *sync.WaitGroup, logger *slog.Logger) {
defer wg.Done()
for {
select {
case <-ctx.Done():
logger.Info("consumer stopped pulling new work")
return
default:
// Poll queue, process one message, ack/nack, then repeat.
// The important part is that once ctx is canceled, this loop
// should stop claiming additional work.
time.Sleep(500 * time.Millisecond)
}
}
}
func main() {
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
workerCtx, stopWorkers := context.WithCancel(context.Background())
healthServer := health.NewServer()
healthServer.SetServingStatus(
"",
grpc_health_v1.HealthCheckResponse_SERVING,
)
grpcServer := grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 5 * time.Minute,
MaxConnectionAgeGrace: 30 * time.Second,
}),
)
grpc_health_v1.RegisterHealthServer(grpcServer, healthServer)
// Register your application services here.
lis, err := net.Listen("tcp", ":9090")
if err != nil {
logger.Error("listen failed", "err", err)
os.Exit(1)
}
app := &App{
grpcServer: grpcServer,
healthServer: healthServer,
stopWorkers: stopWorkers,
logger: logger,
closers: []func() error{
// db.Close,
// kafkaConsumer.Close,
// func() error { return tracerProvider.Shutdown(context.Background()) },
},
}
app.workerWG.Add(1)
go runConsumer(workerCtx, &app.workerWG, logger)
go func() {
if err := grpcServer.Serve(lis); err != nil {
logger.Error("gRPC server exited", "err", err)
}
}()
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
defer signal.Stop(stop)
<-stop
if err := app.Shutdown(25 * time.Second); err != nil {
logger.Error("shutdown finished with errors", "err", err)
}
}The important thing in that sample is not any one API. It is the ordering.
If your system has extra moving pieces, model them explicitly. Do not assume exiting the main process will somehow clean everything up in the right sequence.
5. Metrics, health, and observability
Shutdown quality is much easier to reason about when the service is instrumented for it.
Use the gRPC health check protocol
Do not settle for "the TCP port is still open, so the service must be healthy".
That signal is too weak for gRPC workloads.
What you want instead is the standard grpc.health.v1 protocol. It lets your service say something much more useful than "a socket exists": it tells the platform and other systems whether the server is actually ready to serve traffic.
In grpc-go, this is straightforward:
healthServer := health.NewServer()
grpc_health_v1.RegisterHealthServer(grpcServer, healthServer)
healthServer.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
// During shutdown:
healthServer.SetServingStatus("", grpc_health_v1.HealthCheckResponse_NOT_SERVING)That readiness flip should happen before GracefulStop(). This is how you stop fresh traffic from being admitted while allowing the old traffic to drain.
Also keep liveness and readiness conceptually separate. A downstream dependency wobble may justify readiness going false; it usually should not make Kubernetes kill the process immediately.
Track in-flight RPCs during shutdown
Interceptors are the easiest place to attach observability.
At minimum, track:
- current in-flight unary RPC count
- current in-flight streaming RPC count
- shutdown start timestamp
- number of requests still running when shutdown began
- forced stop count after timeout
Prometheus middleware or OpenTelemetry interceptors make this cheap to add. These metrics are especially useful during rollouts because they answer the question, "Are we actually draining, or are we just waiting?"
The last-gasp log problem
One more shutdown bug lives in the observability layer itself.
Your app may log "shutdown complete" and still lose that line if the logger, tracing exporter, or sidecar driver buffers output and the process exits too quickly afterward.
If you use structured logging with a buffered sink, or OpenTelemetry exporters, explicitly call their flush or shutdown hooks near the end of termination. Otherwise the final and often most useful logs vanish with the container.
Failure modes at a glance
| Symptom during rollout | Likely cause | Usually the fix |
|---|---|---|
Short spike of UNAVAILABLE or reset errors | Process exited abruptly or Stop() used too early | Prefer GracefulStop(), add timeout wrapper, avoid immediate hard stop |
Pod keeps hanging in Terminating | Long-lived streaming RPC blocked GracefulStop() | Add a shutdown deadline and call Stop() as fallback |
| New traffic still hits the pod after shutdown starts | Endpoint propagation lag | Flip readiness early and add a small preStop delay |
| Active RPCs fail with DB or queue errors during drain | Dependencies closed before handlers finished | Reorder shutdown: drain first, close shared resources last |
| Rollouts create uneven traffic distribution | Sticky HTTP/2 channels pin clients to old backends | Use connection age limits or client-side balancing |
| Process exits but memory / goroutines keep leaking in tests | Background goroutines ignored cancellation | Wire all workers to a context or done channel and assert cleanup |
6. Pro tips: the hidden details that matter later
These are the details teams usually learn only after operating gRPC services for a while.
Keepalives and MaxConnectionAge
One subtle reason old pods keep serving traffic is that healthy HTTP/2 connections can live for a very long time.
Setting MaxConnectionAge on the server side helps. It periodically nudges clients off long-lived connections by sending GOAWAY, which encourages them to reconnect and refresh service discovery.
In practice, this reduces the number of extremely stale connections you carry during rollouts or node movement.
grpcServer := grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionAge: 5 * time.Minute,
MaxConnectionAgeGrace: 30 * time.Second,
}),
)This is not a substitute for graceful shutdown. It is a way to make the connection pool healthier even before shutdown begins.
Headless Services vs ClusterIP for gRPC balancing
This is one of the most misunderstood gRPC-on-Kubernetes topics.
With ClusterIP, a client often ends up with one long-lived HTTP/2 connection, and all multiplexed RPCs ride that connection. In effect, load balancing may happen only when the connection is created. That can produce surprisingly sticky backend selection.
With a Headless Service, the client can resolve individual pod IPs and, if configured with a proper client-side balancer such as round_robin or xDS, distribute load across multiple backend connections more intentionally.
The practical takeaway is not that headless is always better. It is that gRPC load balancing happens at the connection/channel layer, not the per-request layer most people expect from REST.
Test for leaking goroutines
Local shutdown tests should verify more than "the process exited".
They should verify that the internal concurrency structure actually unwound.
One very simple check is to compare goroutine counts before and after a test shutdown sequence.
before := runtime.NumGoroutine()
// start server, workers, background loops
// trigger shutdown
time.Sleep(200 * time.Millisecond)
after := runtime.NumGoroutine()
if after > before+2 {
t.Fatalf("possible goroutine leak: before=%d after=%d", before, after)
}This will not catch every issue, but it is a surprisingly effective early warning for workers that never listened to cancellation.
Budget your grace period backwards
A useful rule of thumb is to allocate the grace window deliberately:
5sfor endpoint propagation and traffic drain initiation15-20sfor active RPC completion5sfor forced fallback and final cleanup
The exact numbers depend on your workload, but the mindset is important: do not pick terminationGracePeriodSeconds arbitrarily. Pick it based on the longest legitimate in-flight work you are willing to honor.
Production checklist
Before calling your shutdown story production-ready, make sure the following are true:
- your binary reliably receives
SIGTERM - readiness flips to
NOT_SERVINGbefore drain begins preStopexists if your mesh / load balancer needs propagation timeGracefulStop()is wrapped with a hard timeout- long-lived streams have a shutdown strategy
- background consumers stop taking new work on cancellation
- shared dependencies close after handlers and workers drain
- metrics expose in-flight requests during shutdown
- log / trace exporters flush before process exit
- rollout tests confirm there are no goroutine leaks
Graceful shutdown is one of those engineering details that looks boring until the day it saves a rollout.
And in distributed systems, the difference between "boring" and "painful" is usually just whether you handled termination as a first-class protocol instead of an afterthought.