Graceful Shutdown: A Happy Ending

February 11, 2025

Microservices

Disposability

In a rapidly growing company, the end of each sprint is always a great time to ship new features. As engineers, we often put more focus on whether the feature works, without considering how the existing app version shuts down and how the new one starts up without causing any issues. During this regural process, the service may still be handling ongoing requests, which can result in errors if those requests are not processed due to the application failing to shut down peacefully.

Entering the Exit

Imagine you're running an online booking service in a Kubernetes environment, where users reserve hotels, flights, or rentals. Your system processes thousands of transactions per minute, ensuring seamless experiences for travelers. Now, your team just rolled out a new feature, and Kubernetes is gradually replacing old pods with updated ones. Meanwhile, an instance handling a user's hotel reservation is terminated mid-request before confirming the booking. The request is dropped, leading to a failed reservation or, worse, an inconsistent state where the system charges the user but fails to secure the booking.

Scaling presents another challenge. Suppose traffic suddenly spikes during a flash sale—Kubernetes rapidly provide new pods to handle demand. However, when traffic unexpectedly drops, Kubernetes also scales down by terminating idle pods. If a pod processing a last-minute flight reservation is shut down mid-transaction, the seat might be double-booked or lost entirely. Similarly, if a rental car booking is interrupted, a user might receive a confirmation without an actual reservation. Without proper handling, these failures can break user trust and cause serious inconsistencies.

This is where graceful shutdown becomes essential. When Kubernetes terminates a pod, it first sends a SIGTERM signal, allowing the application to finish active requests, close database connections, and clean up resources. If it doesn't exit within the grace period, Kubernetes forcefully shuts it down with SIGKILL. Therefore, acknowledging the signal is necessary to ensure the application terminates gracefully, preventing dropped requests during deployments or sudden traffic shifts.

Crafting the Exit

Let's have a deeper look on how we can implement this graceful shutdown in Go.

Defines an OnShutdown struct, which includes a resource name and a function that will be executed during the shutdown process. Each resource cleanup function receives a context.Context to ensure proper handling of context timeouts.

pkg/httpx/http.go

type OnShutdown struct {
	Resource string
	Fn       func(ctx context.Context) error
}

Next, we create an Option struct to hold configuration details such as the server address, timeout duration, HTTP handler, and the list of shutdown tasks.

pkg/httpx/http.go

type Option struct {
	Address     string
	Timeout     time.Duration
	Handler     http.Handler
	OnShutdowns []OnShutdown
}

Now we set up the server and ensure it starts with sensible defaults if no values are provided. We also prepare a context that will allow us to handle shutdown signals gracefully.

pkg/httpx/http.go

func Start(option Option) error {
	if option.Address == "" {
		option.Address = "0.0.0.0:8081"
	}
	if option.Timeout == 0 {
		option.Timeout = 10 * time.Second
	}
 
	server := &http.Server{
		Addr:    option.Address,
		Handler: option.Handler,
	}
 
	serverCtx, serverStopCtx := context.WithCancel(context.Background())
 
	slog.InfoContext(serverCtx, fmt.Sprintf("starting server at %s", option.Address))
	if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
		slog.Error(fmt.Sprintf("failed to start server: %s", err.Error()))
		os.Exit(1)
	}
 
	<-serverCtx.Done()
 
	return nil
}

Moving forward to gracefully handle termination signals, we set up a signal listener that triggers the shutdown process when the server receives an interrupt or termination signal.

pkg/httpx/http.go

func Start(option Option) error {
	...
 
	server := &http.Server{
		Addr:    option.Address,
		Handler: option.Handler,
	}
 
	serverCtx, serverStopCtx := context.WithCancel(context.Background())
 
	s := make(chan os.Signal, 1)
	signal.Notify(s, syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
 
	go func() {
		received := <-s
		slog.Info(fmt.Sprintf("received signal: %s. initiating graceful shutdown", received.String()))
 
		slog.Info("server shutdown gracefully")
		serverStopCtx()
	}()
 
	<-serverCtx.Done()
 
	...
}

To prevent the server from hanging indefinitely during shutdown, we implement a timeout mechanism. If the shutdown takes too long, the server forcefully exits.

pkg/httpx/http.go

func Start(option Option) error {
	...
 
	go func() {
		received := <-s
		slog.Info(fmt.Sprintf("received signal: %s. initiating graceful shutdown", received.String()))
 
		shutdownCtx, shutdownStopCtx := context.WithTimeout(serverCtx, option.Timeout)
		defer shutdownStopCtx()
 
		go func() {
			<-shutdownCtx.Done()
			if errors.Is(shutdownCtx.Err(), context.DeadlineExceeded) {
				slog.Error(fmt.Sprintf("graceful shutdown timeout. forcing exist: %s", shutdownCtx.Err()))
				os.Exit(1)
			}
		}()
 
		slog.Info("server shutdown gracefully")
		serverStopCtx()
	}()
 
	...
}

Finally, we execute any necessary cleanup functions before shutting down the server. These functions run concurrently using goroutines and a sync.WaitGroup, ensuring all resources are properly released before termination.

pkg/httpx/http.go

func Start(option Option) error {
	...
 
	go func() {
		...
 
		shutdownCtx, shutdownStopCtx := context.WithTimeout(serverCtx, option.Timeout)
		defer shutdownStopCtx()
 
		...
 
		var wg sync.WaitGroup
		for _, onShutdown := range option.OnShutdowns {
			wg.Add(1)
			slog.Info(fmt.Sprintf("cleaning up %s", onShutdown.Resource))
			go func(shutdownCtx context.Context, onShutdown OnShutdown) {
				defer wg.Done()
				if err := onShutdown.Fn(shutdownCtx); err != nil {
					slog.Error(fmt.Sprintf("failed to shutdown %s: %s", onShutdown.Resource, err.Error()))
				} else {
					slog.Info(fmt.Sprintf("%s resource closed successfully", onShutdown.Resource))
				}
			}(shutdownCtx, onShutdown)
		}
		wg.Wait()
 
		if err := server.Shutdown(shutdownCtx); err != nil {
			slog.Error(fmt.Sprintf("failed to shutdown server gracefully: %s", err.Error()))
		}
 
		...
	}()
 
	...
}

By following these steps, we ensure that the server properly handles termination signals, executes cleanup tasks, and shuts down gracefully while respecting a timeout constraint.

The Perfect Ending

Now that we've implemented a robust shutdown mechanism, let's see how to use it in a real-world scenario. We initialize the server with its configuration and specify cleanup tasks for resources like databases and caches.

cmd/http/main.go

package main
 
func main() {
	cfg := config.Init()
	provider := provider.Init(cfg)
 
	opt := httpx.Option{
		Address: cfg.App.Address,
		Handler: provider.Handlers,
		OnShutdowns: []httpx.OnShutdown{
			{
				Resource: "database",
				Fn:       provider.DatabaseShutDownFn,
			},
			{
				Resource: "redis",
				Fn:       provider.RedisShutdownFn,
			},
		},
		Timeout: 4 * time.Second,
	}
	if err := httpx.Start(opt); err != nil {
		slog.Error(fmt.Sprintf("failed to start server: %s", err.Error()))
	}
}

Here are the example log when a running server is interrupted or terminated.

{"time":"2025-02-10T23:50:07.118866+07:00","level":"INFO","msg":"starting server at 0.0.0.0:3000"}
{"time":"2025-02-10T23:50:08.737842+07:00","level":"INFO","msg":"received signal: interrupt. initiating graceful shutdown"}
{"time":"2025-02-10T23:50:08.73807+07:00","level":"INFO","msg":"cleaning up database"}
{"time":"2025-02-10T23:50:08.738084+07:00","level":"INFO","msg":"cleaning up redis"}
{"time":"2025-02-10T23:50:11.739503+07:00","level":"INFO","msg":"redis resource closed successfully"}
{"time":"2025-02-10T23:50:11.739503+07:00","level":"INFO","msg":"database resource closed successfully"}
{"time":"2025-02-10T23:50:11.739825+07:00","level":"INFO","msg":"server shutdown gracefully"}

Wrapping Up the Exist

A well-designed system should handle startup and shutdown with care, ensuring disposability, a key principle of The Twelve-Factor App. This means applications must start quickly and shut down gracefully, allowing seamless deployments, autoscaling, and failure recovery. By implementing a structured shutdown process, active requests complete without loss, resources like databases and caches are properly closed, and the system remains resilient and scalable. This approach ensures services are robust, production-ready, and leave no loose ends behind when they exit.

Built with Next.js, MDX, Tailwind and Vercel