Home/Blog/All/From 200 Microservices to One Control Plane with Backstage

Tech

From 200 Microservices to One Control Plane with Backstage

How Backstage evolved from a service catalog into the operational control plane for 200+ microservices, safer rollouts, and self-serve platform workflows.

← Back to all blogsMarch 1, 2026#tech#microservices#kubernetes

From 200 Microservices to One Control Plane with Backstage

Once you cross 200 microservices, the problem is no longer just architecture. It is coordination.

At that scale, engineers are not struggling because they cannot find the right YAML field or deployment command. They are struggling because every operational task is fragmented across too many tools, too many repos, and too many handoffs. One tab has the catalog. Another has Argo CD. A third has GitHub Actions. Logs live somewhere else. Ownership is documented in a wiki that may or may not be current. The result is predictable: slower rollouts, more context switching, and an operations team that becomes an accidental bottleneck.

What we wanted instead was a single pane of glass that did more than observe the system. We wanted a control plane where developers could understand a service, inspect its dependencies, change safe operational settings, and drive deployments without leaving the same context.

Backstage became that layer.

The complexity wall

Early on, manual coordination works because people can hold the system in their heads. At 20 services, a release manager can still answer questions like:

  • Which downstream systems will this change affect?
  • Who owns this service today?
  • Where are the dashboards and logs?
  • What is the safest rollout path for this environment?

At 200+, that model collapses. Teams change faster than documentation. New services inherit patterns inconsistently. SREs spend too much time acting as human routers for tasks that should be self-serve. Even simple actions like increasing worker replicas or validating a canary become dependent on institutional memory.

That was the inflection point for us. We did not need another static developer portal. We needed an operational interface that connected visibility, governance, and action.

The visibility layer

Before Backstage could become an action surface, it had to become the source of truth for how the platform was organized.

Namespace-based cataloging

The first step was giving engineers a sane mental model. We grouped services using namespaces aligned to domains and platform boundaries, then represented subcomponents explicitly instead of flattening everything into a single service entry.

That mattered more than it sounds.

When checkout has APIs, workers, scheduled jobs, Kafka consumers, and shared data contracts, a flat list becomes noise. Namespace-based cataloging gave teams cognitive clarity:

  • domain-level grouping for faster discovery
  • subcomponent visibility for operational precision
  • cleaner ownership boundaries across platform and product teams

Instead of asking, "Which repo owns this consumer?" engineers could navigate from the domain to the exact component and immediately see the metadata they needed.

The service graph

Once the catalog was structured properly, the service graph became genuinely useful.

For operational work, the graph was less about drawing pretty boxes and more about answering risk questions quickly:

  • What breaks upstream if this service is unhealthy?
  • Which downstream consumers should be monitored during a rollout?
  • Is this dependency synchronous, event-driven, or purely data-oriented?

That reduced the time between noticing a problem and forming a response plan. During releases, the graph became the fastest way to estimate blast radius without pulling senior engineers into every change review.

Team ownership, SSO, and RBAC

Visibility without governance creates confidence theater. Backstage only became trustworthy once ownership and access control were enforced.

We integrated team ownership metadata directly into catalog entities and tied authentication to SSO groups. That let us use RBAC for the actions that mattered:

  • only the owning team could change production-facing settings
  • platform teams could expose safe controls without exposing raw cluster access
  • auditability improved because approvals and operational changes were tied to identity

This changed the tone of operations. Instead of relying on Slack archaeology to find the right owner, people could see accountability in the same view where they were taking action.

Operational empowerment

Once discovery and ownership were solid, the next step was enabling controlled self-service.

Self-serve Kubernetes

One of the highest leverage improvements was exposing common Kubernetes operations inside Backstage.

Developers did not need unfettered cluster access. They needed the ability to make a narrow set of safe, reversible changes without filing an ops ticket or waiting for an SRE to act as a middleman. So we exposed opinionated controls for things like:

  • scaling replicas up or down within guardrails
  • adjusting HPA-related thresholds that had been pre-approved
  • restarting workloads safely
  • inspecting current deployment configuration per environment

The important design choice was that Backstage did not become a generic kubectl replacement. It became a constrained control surface. The platform team decided which knobs were safe, what ranges were allowed, and which identities could touch which environments.

That preserved governance while eliminating a huge class of low-value operational handoffs.

Custom infrastructure tuning

The next layer was exposing infrastructure controls that were too context-specific for off-the-shelf plugins.

For example, we added custom plugins for runtime tuning tasks such as:

  • changing sampling rates for logging during an incident
  • updating dynamic configuration values without shipping a fresh build
  • surfacing environment-specific feature toggles that affected system behavior

These controls mattered because they turned Backstage into a place where developers could respond to live conditions. If error volume spiked during a rollout, the team could temporarily increase observability detail, inspect the effect, and decide on next steps without bouncing between three different consoles.

That is where the portal stopped being a documentation layer and started feeling like a platform product.

The deployment control plane

This was the most meaningful shift: Backstage became the frontend for release orchestration.

Triggering actions from one interface

We used Backstage as the operator-facing UI for GitHub Actions and Argo CD.

At a high level, the flow looked like this:

  1. A developer selected a service, environment, and target version in Backstage.
  2. Backstage triggered a GitHub Actions workflow to validate inputs, prepare release metadata, and update the GitOps source of truth.
  3. Argo CD reconciled the desired state and began the rollout process.
  4. Backstage continued to display rollout state, service health, and linked diagnostics in the same view.

This mattered because deployment intent, deployment execution, and deployment observation were no longer split across separate tools and mental contexts.

The Backstage approval pattern

The hardest part was not the button to start a rollout. It was the human approval step in the middle.

In a pure GitOps model, the desired state is declarative and eventually reconciled. But real production releases often need a pause point after an initial canary slice or environment-specific verification. We needed a way to bridge those two worlds without falling back to manual side channels.

The pattern that worked well for us looked like this:

  1. GitHub Actions kicked off the release and advanced it to a pause point.
  2. At that checkpoint, the workflow published a pending approval record to a custom Backstage backend.
  3. The backend stored release state keyed by service, environment, rollout phase, actor, and expiry.
  4. Backstage rendered that pending state directly on the service page alongside health signals, logs, and ownership context.
  5. An authorized user clicked approve, promote, or abort inside Backstage.
  6. The backend verified RBAC, recorded the audit event, and signaled the waiting workflow or Argo CD integration to continue.

That custom backend was the glue between a GitOps desired-state engine and a human approval workflow. It gave us a clean place to model release state explicitly instead of encoding approvals in ad hoc comments, Slack messages, or tribal process.

The developer experience win was obvious: the person making the approval decision was doing it in the same context where they had just checked rollout health, upstream/downstream dependencies, logs, and ownership.

Progressive delivery and canary confidence

Once approvals lived inside Backstage, progressive delivery became much easier to operationalize.

We supported multi-phase rollouts such as canary and blue-green patterns, with the current phase and blast radius visible from the same dashboard. A rollout was no longer just "in progress" or "done." Engineers could see whether a service was at 5%, 25%, or 50%, what environment was active, and what the next promotion step would do.

This is where combining plugins paid off. During a canary, engineers could look at real-time error rates through the logging and observability integrations sitting next to the Argo CD rollout controls. That meant the question, "Should we promote this to 100%?" could be answered with live signals, not instinct.

In practice, that reduced hesitation and reduced reckless confidence at the same time. Teams were faster because the evidence was in front of them, and safer because they could judge the blast radius before taking the next action.

Democratizing APIs and documentation

Operational excellence is only half the story. Backstage also made the system easier to understand and use across teams.

Centralized proto definitions

In many microservice environments, service contracts are technically versioned but practically hard to find. The protobuf or API definitions exist somewhere, but discovering the right repo, branch, or package is a tax on every new integration.

We centralized proto definitions and surfaced them through Backstage so engineers were not hunting through repositories just to answer basic contract questions. That improved:

  • onboarding for new team members
  • confidence during cross-team integrations
  • release reviews when schema changes were involved

It also tightened the loop between service ownership and contract visibility. The same page that showed who owned a service also showed the interfaces it exposed.

The try-it experience

The next step was lowering the barrier to experimentation.

We exposed sample API calls and lightweight try-it workflows so PMs, developers, and adjacent teams could validate endpoints without needing a full local setup or a deep understanding of the repo layout. In the best cases, people could:

  • inspect request and response shapes
  • run sample calls against approved environments
  • validate integration assumptions before asking another team for help

That reduced back-and-forth and made collaboration much more practical. Documentation stopped being a passive asset and became an entry point for real interaction.

Lessons learned and cultural impact

The biggest outcome was not a specific plugin. It was a shift in how the organization thought about operations.

Reducing context switching

The productivity benefit of keeping an engineer in one tab is easy to underestimate.

Every tool switch imposes a tax: reloading context, re-checking identity, remembering where the last action happened, and translating between different representations of the same system. By putting catalog data, rollout controls, health checks, logs, and API docs in one place, we reduced that tax on nearly every operational workflow.

The gain was not just speed. It was better decisions under pressure.

From gatekeeping to platform building

Backstage also changed the role of operations teams.

Instead of acting as approval bottlenecks or manual executors for routine tasks, Ops and SRE teams could focus on building safe abstractions: defining guardrails, exposing the right controls, and improving the platform surface area over time.

That is the real DevEx shift. The goal is not to remove operational discipline. The goal is to encode it into the platform so more teams can move independently without increasing risk.

Final thought

At small scale, a developer portal can be mostly about discovery. At 200+ microservices, that is not enough.

The real value appears when Backstage becomes the place where teams understand the system, act on the system, and make release decisions with the right context in front of them. That is when it stops being a catalog and starts becoming a control plane.