Some years ago I had the pleasure of investigating a production outage in a customer-facing application. I say pleasure because these investigations can be genuinely enjoyable. You do meaningful work under pressure. When you find the answer and fix it, everyone is satisfied. The pressure at the time was heavy, but the work itself was good.

The root cause was a configuration file that a Spring Boot application tried to load from an (internal) GitHub repository at startup. The application ran in the cloud. During high business load, the containers restarted. GitHub happened to be unreachable at that moment. The startup failed. Customers couldn’t access the service.

We found and fixed the cause. We documented the lesson: production assets must not depend on development resources like GitHub. That dependency was eliminated.

Then someone asked the natural follow-up question.

“What other critical applications in our landscape depend on less critical ones?”

The Answers You Get

In most large enterprises, the answers sound like this:

  • “None that I know of.”
  • “We shouldn’t have that.”
  • “We need to ask the teams.”

So questionnaires go out, answers get collected in Excel, presentations get made in PowerPoint. Perhaps a dashboard appears in PowerBI. Actions are defined and tracked.

This creates cost and friction because the basic information isn’t maintained at the source. The data quality is unreliable: manually collected rather than automatically discovered. And once the outage is fixed, it recedes into memory.

Then the next production outage happens. Different application. Different dependency chain. Same fundamental problem.

The Database Solution

The obvious answer is to maintain a database with the dependencies. Add an inventory of all applications with their criticality levels. Query it: How critical is this application? How critical are its dependencies? Show me the mismatches.

Many organizations have exactly this and suffer from persistent data quality problems.

Why? Maintaining it requires discipline most organizations struggle with. Applications change, teams restructure, dependencies shift. The database reflects reality only if someone updates it when reality changes.

But reliability requires more than process discipline. It requires structural support in the data model itself.

You need data types that distinguish “we checked and there are none” from “nobody has entered anything yet.” You need timestamps on every relationship, not just every asset. You need automatic staleness detection that flags entries older than some threshold. You need a process that prompts owners to reconfirm or update data when it ages.

This is infrastructure work that can be automated, but organizations rarely prioritize it. Without it, your dependency database is a collection of guesses with unknown freshness.

Two Different Questions

Organizations confuse two questions.

“What do we have, and where is it deployed?” That’s inventory.

“What will break, and what depends on it?” That’s prediction.

Configuration Management Databases answer the first question well. Many have automatic discovery built in. They’re designed to maintain current state with decent data quality. That’s their job.

An architecture model should answer the second question. It should capture not just what exists, but what could exist, what should exist, and the dependencies between possible states.

One is a catalog of the present. The other is a map of risks and possibilities.

Most architecture repositories try to be both and end up being neither: too heavyweight to maintain as real-time inventory, too unreliable to trust as a forward-looking risk signal.

What Works

The organizations I’ve seen succeed maintain a clear division of labor. The CMDB tracks what exists right now. The architecture model captures criticality, strategic dependencies, and mismatches that represent risk. Risk platforms track the identified issues and their remediation. Portfolio management tools track the projects meant to fix them.

These are separate systems, but they reference the same entities. The cross-checking between them reveals data quality problems you’d miss if everything lived in one place.

This ecosystem approach requires active consistency management. Is it more work than a single tool? Yes. Does it produce reliable answers when someone asks “What will break if this fails?” Also yes.

The Pattern

Production outages reveal something about responsibilities.

Solution architects document the dependencies their services create. Enterprise architects oversee the dependencies that fall between services - the ones nobody owns until something breaks.

The job isn’t documenting what exists. It’s predicting what will fail.

The organizations that grasp this distinction stop having the same outage twice. They don’t wait for production to teach them where the risks are. They can answer “what depends on what, and are the criticality levels aligned?” before someone asks under pressure at 3am.

That’s not documentation. That’s architecture doing its job.