From Single-Model Queries to Multi-Agent Reasoning

By Karthick Sundaram, Engineering Leader, Enterprise AI & Supply Chain Technology, Microsoft Corporation

Abstract

Most enterprise AI deployments connect a single large language model to a single data source, producing answers that lack the cross-domain reasoning required for high-quality decisions. This paper presents a production architecture that overcomes this limitation through multi-agent orchestration: a system where specialized AI agents query different enterprise data sources, a synthesizer combines their findings, and an evaluator validates quality before the answer reaches the user.

The architecture is config-driven: adding an entirely new business domain with its own data agents, contextual grounding, routing rules, and user interface requires no code changes. Engineers add JSON configuration files and markdown prompt templates; the platform does the rest. The system implements a five-step reasoning pipeline (plan, hypothesize, gather evidence, synthesize, evaluate) with iterative re-planning when quality thresholds are not met.

A persistent memory layer allows the system to learn from past interactions corrections, recurring patterns, and domain-specific insights are retained across sessions and shared among reasoning agents while remaining isolated between business domains through profile-scoped tagging.

We describe the integration of a cloud AI platform for reasoning and memory with an enterprise data platform for structured data access, connected through identity-preserving authentication. Deployment results show that a new five-agent business domain was onboarded in under 75 minutes with zero lines of code changed, validating the config-driven scaling hypothesis.

Keywords: multi-agent systems, enterprise AI, decision intelligence, persistent memory, config-driven architecture, hypothesis-driven reasoning, generative AI, retrieval-augmented generation

1. Introduction

Enterprise organizations have invested heavily in generative AI. Chatbots answer employee questions, dashboards surface AI-generated summaries, and automated alerts flag anomalies. Yet a persistent gap remains: tasks move faster, outputs sound confident, but decision quality often stays unchanged.

The root cause is architectural, not algorithmic. Most enterprise AI systems follow a familiar pattern: one model connected to one data source, answering one question at a time. This works for simple lookups "What is the price of part X?" but fails when a decision requires reasoning across multiple data domains, weighing competing explanations, and synthesizing evidence from sources that were never designed to work together.

Consider a supply chain analyst who asks: "Why did our production costs increase for this component?" Answering meaningfully requires querying budget systems for spend variances, pricing databases for supplier cost changes, and gap-analysis tools for contract deviations. These are three separate systems with different schemas, different owners, and different access controls. A single-agent architecture cannot perform this cross-domain reasoning.

A multi-agent system can but building one traditionally requires extensive custom code for each new domain. Every new data source means new orchestration logic, new routing rules, and new integration code. This engineering overhead limits adoption to a handful of high-value use cases, leaving the long tail of enterprise decisions unserved.

This paper presents an architecture that addresses both challenges simultaneously: (1) deep cross-domain reasoning through a hypothesis-driven evidence synthesis pipeline, and (2) rapid domain scaling through a config-driven design that requires zero code changes to onboard new business domains. We describe a production system deployed at a major technology company that orchestrates up to eight specialized data agents across two business domains, coordinated by three reasoning agents with persistent memory.

Our key contributions are:

A five-step reasoning pipeline (plan, hypothesize, gather, synthesize, evaluate) that brings scientific methodology to enterprise data queries, with iterative re-planning on quality failures.
A config-driven agent architecture where new business domains are added through JSON configuration and markdown prompt templates alone no code changes required.
A dual-platform integration pattern that separates reasoning agents from data agents, connected through identity-preserving authentication for user-level data access control.
A persistent memory architecture with profile-scoped isolation that enables multiple business domains to share a single memory store without cross-domain contamination.
Production deployment results showing that a new five-agent domain was onboarded in under 75 minutes with zero backend code changes.

2. The Problem: Why Single-Agent AI Falls Short

2.1 The Single-Source Trap

Most enterprise AI chatbots are wrappers around a single large language model connected to a single knowledge base. The user asks a question, the model retrieves relevant documents, and it generates an answer. This pattern often called Retrieval-Augmented Generation (RAG) works well for factual lookups within a single domain.

But enterprise decisions rarely depend on a single source. A procurement decision involves budget data, supplier pricing, market benchmarks, compliance records, and contract terms. Each lives in a different system. When the AI can only see one system, it produces answers that are technically correct but contextually incomplete the equivalent of a doctor diagnosing a patient by looking at only one lab result.

2.2 The Scaling Wall

Organizations that recognize this limitation sometimes build multi-source AI systems. But these are typically hand-coded: an engineer writes custom logic to call each data source, combine results, and format the output. When a new data source or business domain is needed, the same engineer must write more custom code.

This creates a scaling wall. The system works for its original domain, but expanding to a second or third domain requires engineering effort proportional to the first. The bottleneck is not AI capability it is the code required to connect new data sources, define new routing rules, and adapt the reasoning logic for each domain.

2.3 The Trust Gap

Even when AI systems produce technically accurate answers, enterprise users often distrust them because they cannot see the reasoning. The AI says costs increased by 12%, but does not show which data sources it consulted, what alternative explanations it considered, or how confident it is in the answer. Research on AI-assisted decision-making consistently shows that outcomes improve when people can evaluate the reasoning behind AI outputs, not just the outputs themselves.

2.4 The Amnesia Problem

Most AI systems are stateless: every conversation starts from scratch. If a user corrects a misinterpretation "When I say 'cost variance,' I mean the delta between forecast and actual, not the standard deviation" that correction is lost the moment the session ends. The user must repeat the same correction in the next session, and the next, and the next.

For enterprise decision support, this amnesia is particularly costly. Domain-specific terminology, organizational conventions, and recurring analytical patterns accumulate over time. A system that cannot remember past interactions forces users to re-teach it continuously, eroding both productivity and trust.

3. Architecture Overview

The proposed architecture addresses these four challenges single-source limitation, scaling wall, trust gap, and amnesia problem through four layers that work together:

A lightweight backend that controls the reasoning loop and streams progress to users in real time.
A cloud AI platform that hosts reasoning agents (orchestrator, synthesizer, evaluator) with persistent memory.
An enterprise data platform that hosts specialized data agents, each connected to a specific business dataset.
A configuration layer (JSON files and markdown templates) that makes the entire system domain-agnostic.

+-------------------+ +------------------------+ +---------------------+  
| User Interface | --> | Orchestration Backend | --> | AI Reasoning Layer |  
| (Web Application)| | (Reasoning Loop) | | - Orchestrator |  
| Profile Switcher | | Real-Time Streaming | | - Synthesizer |  
| Live Progress UI | | Identity Management | | - Evaluator |  
+-------------------+ +------------------------+ | - Web Search |  
| | - Doc Export |  
v | + Shared Memory |  
+------------------------+ +---------------------+  
| Enterprise Data Layer |  
| Data Agent 1 (Budget) | +---------------------+  
| Data Agent 2 (Pricing)| | Configuration Layer |  
| Data Agent 3 (Gaps) | | Agent Registry |  
| Data Agent 4 (Audit) | | Domain Profiles |  
| Data Agent 5 (Supply) | | Prompt Templates |  
| ...up to N agents | | Grounding Context |  
+------------------------+ +---------------------+

Figure 1. Four-layer architecture: user interface, orchestration backend, AI reasoning platform with persistent memory, and enterprise data agents, all driven by a configuration layer.

The critical design principle is separation of concerns. The backend controls the reasoning loop it parses the orchestrator's plan, dispatches data agents, feeds evidence to the synthesizer, and routes the synthesis to the evaluator. The AI models produce reasoning plans and natural language analysis, but they do not control execution flow. This prevents common failure modes where language models skip steps, hallucinate tool calls, or loop indefinitely.

4. The Five-Step Reasoning Pipeline

Every user query passes through a five-step pipeline that mirrors the scientific method: formulate hypotheses, gather evidence, analyze findings, and validate conclusions. This is fundamentally different from single-pass generation where a model produces one answer and hopes it is correct.

User Query  
|  
v  
+-----------------------+  
| Step 1: PLAN | Orchestrator generates hypotheses and  
| Plan + Hypothesize | identifies which data agents to query.  
+-----------+-----------+  
|  
v  
+-----------------------+  
| Step 2: GATHER | Data agents are dispatched in parallel.  
| Evidence Collection | Each returns structured evidence.  
+-----------+-----------+  
|  
v  
+-----------------------+  
| Step 3: SYNTHESIZE | Synthesizer combines all evidence into  
| Evidence Synthesis | a draft answer with confidence score.  
+-----------+-----------+  
|  
v  
+-----------------------+  
| Step 4: EVALUATE | Evaluator scores accuracy, relevance,  
| Quality Gate | completeness, and clarity.  
+-----------+-----------+  
|  
PASS? | INSUFFICIENT?  
| +--------+  
v v  
+-----------+ Re-plan with  
| Step 5: | feedback (max 3x)  
| DELIVER | then back to  
| to user | Step 2  
+-----------+

Figure 2. The hypothesis-driven reasoning loop. On quality failure, the evaluator triggers re-planning with specific feedback, up to three iterations.

4.1 Step 1: Planning and Hypothesis Generation

The orchestrator, a reasoning-focused AI agent, receives the user's question along with descriptions of all available data agents and the current domain's routing rules. Rather than attempting to answer directly, it produces a structured plan: a set of hypotheses about what might explain the answer, and a list of specific questions to ask specific data agents.

For example, if a user asks why costs increased, the orchestrator might generate three hypotheses: (H1) raw material prices rose, (H2) supplier contract terms changed, (H3) production volumes shifted. It then maps each hypothesis to the best positioned data agent to provide evidence a pricing agent for H1, a contract analysis agent for H2, and a budget agent for H3.

Before generating the plan, the orchestrator also searches for its persistent memory for relevant context from previous interactions past corrections, known data quirks, or recurring patterns and incorporates these into the hypothesis generation process. This memory-informed planning means the system improves with use.

4.2 Step 2: Parallel Evidence Gathering

The backend dispatches questions to data agents using a parallel-serial strategy. Questions targeting different agents run concurrently (since they access different data sources). Questions targeting the same agent run sequentially to avoid overwhelming the data service with concurrent requests. Each dispatch includes retry logic with exponential backoff to handle transient failures gracefully.

Each data agent receives a carefully constructed message that includes: (a) grounding context describing the data schema it can access, (b) behavioral instructions from its prompt template, (c) user identity information for access control, and (d) the specific question with its hypothesis context. This ensures the data agent has sufficient context to query its underlying dataset accurately.

4.3 Step 3: Evidence Synthesis

The synthesizer, a second reasoning agent receives all evidence from the data agents along with the original hypotheses and user query. Its job is to combine findings into a coherent answer, applying several analytical checks:

Temporal ordering: Does the proposed cause actually precede the effect in the data?
Confounding variables: Are there other factors that could explain both the cause and the effect?
Counterfactual reasoning: Would the outcome plausibly differ without the proposed cause?
Dose-response: Does more of the cause lead to more of the effect?

The synthesizer also produces alternative explanations other plausible causes the data supports rather than defaulting to the first plausible explanation. This guards against confirmation bias, a common failure mode in AI-generated analysis.

4.4 Step 4: Quality Evaluation

The evaluator a third reasoning agent acts as a quality gate. It scores the synthesized answer on four weighted criteria:

Criterion	Weight	What It Measures
Data Accuracy	50%	Do the numbers in the answer match the evidence collected?
Answer Relevance	30%	Does the answer directly address what the user asked?
Completeness	10%	Does it cover all aspects the user explicitly requested?
Clarity	10%	Is the format appropriate for the complexity of the data?

Table 1. Evaluation criteria and weights used by the quality gate.

If the weighted score falls below 0.70 or critical data mismatches are detected, the evaluator returns an INSUFFICIENT verdict with specific feedback: which data agents should be re-queried, what additional evidence is needed, and which aspects of the synthesis need improvement. The pipeline then loops back to Step 2 with this feedback, up to three times.

4.5 Step 5: Delivery with Transparency

On a PASS verdict, the final answer is delivered to the user alongside metadata that builds trust: which data sources were consulted, which hypotheses were tested, the confidence score, how many iterations the pipeline required, and the total elapsed time. The user sees not just the answer but the reasoning chain that produced it.

5. Config-Driven Architecture for Domain Scaling

The central thesis of this work is that enterprise multi-agent systems should be extensible through configuration rather than code. The architecture achieves this through three configuration layers: an agent registry, domain profiles, and prompt templates.

5.1 The Agent Registry

The agent registry is a single JSON file that serves as the source of truth for all agent definitions in the system. It contains the configuration for system agents (orchestrator, synthesizer, evaluator), shared memory settings, and the complete list of data agents. Each data agent entry specifies a unique name, a human-readable label, a detailed natural-language description of its capabilities, the identifier of the enterprise dataset it can query, a pointer to its grounding context file, and a pointer to its behavioral prompt template.

The description field is the most important routing signal in the entire system. When the orchestrator plans which agents to query, it reads these descriptions to understand each agent's capabilities. Descriptions that include specific column names, supported query patterns, and explicit exclusions ("Does NOT contain fiscal year data") lead to significantly more accurate routing decisions than vague summaries.

5.2 Domain Profiles

Profiles define business domains. Each profile is a JSON file that selects a subset of agents from the registry, provides domain-specific routing rules, configures the user interface (welcome text, icons, suggested questions), and defines data-access patterns. The system loads only the agents referenced by the active profile, so adding a new profile does not affect existing domains.

For example, a sourcing intelligence profile might reference three data agents covering budgets, pricing, and contract gaps, while a compliance governance profile references five different data agents covering audits, supplier sustainability, factory analytics, high-risk materials, and smelter tracking. Both profiles share the same system agents (orchestrator, synthesizer, evaluator), the same web search agent, and the same document export agent but each sees only its own data agent.

Routing rules within each profile use tag-based matching priority levels. A sourcing profile might define: "Route all data queries tagged [budget, spend, cost] to the budget agent with primary priority," and "Route queries tagged [web, market, external] to the web search agent with secondary priority." These rules guide the orchestrator's dispatch decisions without hard-coding them in application logic.

5.3 Prompt Templates and Grounding Context

Each agent has two associated files. A prompt template (a markdown file) defines how the agent should behave with its analytical style, output format, and domain-specific rules. A grounding context file (also markdown) provides schema information, column descriptions, and query examples that help the data agent translate natural language questions into accurate database queries.

The orchestrator's own prompt template uses placeholder variables that are populated at runtime from the active profile. When a user switches from one domain to another, the same template is rendered with different agent names, routing rules, and domain context. The reasoning logic remains identical; only the domain knowledge changes.

5.4 The Zero-Code Onboarding Process

To add a new business domain, an engineer performs these steps none of which require modifying any application code:

Create a data agent on the enterprise data platform, pointing it at the relevant dataset.
Add the agent definition to the agent registry with its data platform identifiers.
Create a grounding context file describing the data schema the agent can access.
Create a prompt template defining the agent's behavior and output format.
Create a profile JSON file referencing the new agents and defining routing rules.
Deploy the updated configuration files. No infrastructure re-provisioning is needed.

What changes when adding a new domain:

+-- agent_registry.json (add agent entries) CONFIG ONLY  
+-- profiles/new_domain.json (new profile) CONFIG ONLY  
+-- grounding/context_agent_N.md (schema docs) MARKDOWN ONLY  
+-- prompts/domain_agent_N.md (behavior) MARKDOWN ONLY  
What does NOT change:  
+-- Backend orchestration code ZERO CHANGES  
+-- Frontend application code ZERO CHANGES  
+-- Infrastructure / deployment scripts ZERO CHANGES  
+-- System agent definitions (orchestrator, etc.) ZERO CHANGES

Figure 3. Adding a new business domain requires only configuration files and markdown templates. All application code remains unchanged.

6. Dual-Platform Integration

The architecture integrates two distinct cloud platforms: a cloud AI platform for reasoning and memory, and an enterprise data platform for structured data access. This separation is intentional and provides several benefits.

6.1 Why Two Platforms?

Data agents are managed by data teams using the data platform's own tools, with no dependency on the AI application.
Reasoning agents benefit from the AI platform's built-in capabilities: persistent memory, web search, code execution.
Data agents and reasoning capacity can scale independently based on demand.
Schema changes in enterprise datasets are absorbed by the data agent's natural language interface without any code changes.

6.2 Identity-Preserving Authentication

Enterprise data access must respect user-level permissions. When a manager queries the system, they should see data appropriate to their role; a different user with different permissions should see different results from the same query.

The architecture achieves this through identity passthrough: the user authenticates in the browser, the backend exchanges the user's token for a data-platform-scoped token (preserving the user's identity), and the data agent executes the query under the user's own permissions. This means the AI system inherits the same data governance controls that already exist in the enterprise data platform row-level security, column-level masking, and access policies all apply automatically.

When identity passthrough is not available (for example, if administrative consent has not been granted), the system gracefully degrades to a service identity. Data agents remain functional but user-specific access controls are not enforced. The system detects this condition automatically and can notify administrators.

6.3 Data Agent Invocation Pattern

Each data agent exposes a conversational API similar to popular AI chat interfaces. The orchestration backend creates a conversation thread, sends a message containing the question and grounding context, starts a processing run, polls for completion, and retrieves the response. This interaction pattern is stateless from the backend's perspective: each query creates a fresh thread to avoid context contamination between unrelated questions.

The system supports data agents spanning multiple workspaces on the data platform. Each agent in the registry can specify its own workspace identifier, allowing a single system to query data from organizational boundaries for example, sourcing data in one workspace and compliance data in another.

7. Persistent Memory Architecture and Profile Isolation

An effective enterprise reasoning system must learn from past interactions. If a user corrects the system's interpretation of a metric, that correction should persist across sessions. If the system discovers a recurring pattern in the data, it should remember that pattern for future queries. Without persistent memory, every conversation starts from scratch, and the system can never move beyond a novice's understanding of the domain.

7.1 Shared Memory Across Reasoning Agents

The architecture uses a shared memory store provided by the AI platform. All reasoning agents the orchestrator, synthesizer, and evaluator read from and write to the same memory. This shared access enables powerful cross-agent learning:

The orchestrator stores planning patterns that worked well for specific query types, so future similar queries benefit from improved hypothesis generation.
The synthesizer records analytical insights and user corrections, so future syntheses incorporate domain-specific knowledge accumulated over time.
The evaluator recalls quality thresholds and common failure modes from previous evaluations, improving its ability to catch subtle errors.
Post-pipeline consolidation writes a summary of each completed reasoning cycle to memory, creating a searchable history of past analyses.

7.2 What Gets Remembered

The memory store captures several categories of information:

User corrections: When a user says "That interpretation of cost variance is wrong it should be forecast minus actual," the system stores this correction and applies it to all future queries involving that metric.
Domain patterns: Recurring data patterns (e.g., "Budget data for Q4 is typically finalized two weeks after quarter-end") are stored and surfaced when relevant.
Routing preferences: If the orchestrator discovers that a particular data agent consistently provides better answers for certain query types, that preference is remembered.
Evaluation feedback: When the evaluator triggers re-planning, the feedback and resolution are stored so the system can avoid the same mistake in the future.

7.3 Profile-Scoped Isolation

A shared memory store creates a contamination risk when multiple business domains use the same system. An insight from the sourcing domain should not influence compliance domain responses, and vice versa. A cost-variance correction specific to procurement budgets should not alter how the system interprets cost variance in a supply chain audit context.

Rather than maintaining separate memory stores per domain (which increases infrastructure cost and complexity), the architecture achieves isolation through prompt-based tagging. Every memory-touching step in the pipeline includes an instruction to tag stored memories with the active domain profile and to retrieve only memories matching the current profile.

This tagging is enforced consistently across six touchpoints in the pipeline:

Orchestrator planning: memories are recalled and stored with the active profile tag.
Synthesizer analysis: domain insights are tagged before storage.
Evaluator feedback: evaluation patterns are profile-scoped.
Post-pipeline consolidation: the reasoning summary is tagged with the profile.
User feedback storage: corrections are scoped to the domain where they were made.
Memory recall: all retrieval queries filter by the active profile tag.

The result is effective logical isolation within a single physical memory store. Each domain accumulates its own knowledge base over time, and switching between domains feels like switching between two separately trained assistants even though they share the same underlying infrastructure.

7.4 Memory-Informed Reasoning in Practice

In practice, persistent memory transforms the system from a stateless query engine into a learning analytical partner. Consider a user who regularly queries cost trends for a specific product line. Over time, the system remembers:

Which data agents provide the most relevant data for that product line.
How the user prefers the data presented (tables vs. narrative, granularity level).
Domain-specific terminology and conventions the user has corrected in the past.
Seasonal patterns the system has identified in previous analyses.

Each subsequent query benefits from this accumulated context, resulting in faster, more accurate, and more relevant responses. The system effectively becomes more expert in each domain the more it is used a qualitative improvement over stateless architectures that treat every interaction as if it were the first.

8. Real-Time Transparency and User Experience

Enterprise reasoning queries can take 15 to 50 seconds depending on complexity: planning takes a few seconds, each data agent call takes 5-15 seconds, synthesis and evaluation add more time, and re-planning multiplies the total. Without feedback, users perceive the system as broken or unresponsive.

The architecture addresses this through real-time streaming. As the pipeline executes, it emits a continuous stream of progress events to the user interface. Users see each step begin and complete: "Planning hypotheses..." then "Querying Budget Agent..." then "Querying Pricing Agent..." then "Synthesizing evidence..." then "Evaluating quality..."

This transparency serves two purposes. First, it reduces perceived latency by giving users a sense of progress. Second, it builds trust by making the reasoning process visible. Users can see that the system consulted multiple data sources, tested multiple hypotheses, and validated the answer before presenting it. This visibility directly addresses the trust gap identified in Section 2.

8.1 Profile-Driven Interface Customization

The user interface adapts to the active domain profile without code changes. Each profile defines its own branding (name, badge, welcome text), its own icon, and its own suggested questions. When a user switches domains via a dropdown, the interface updates instantly: new branding, new suggestions, new data agents in the sidebar. This is driven entirely by the profile's JSON configuration.

9. Deployment Architecture

The system is deployed as a containerized application on a cloud application hosting service. The infrastructure is defined as code (Infrastructure as Code) and includes: a container registry for the application image, a Linux-based application host with WebSocket support for real-time streaming, a system-assigned managed identity for secure service-to-service authentication, and an identity validation sidecar that handles authentication at the network level.

An important deployment distinction exists between system agent changes and domain agent changes. System agents (orchestrator, synthesizer, evaluator) are hosted on the AI platform and require a provisioning step when their prompt templates change. Domain agent changes adding, modifying, or removing data agents are purely configuration changes. Updated JSON and markdown files are deployed with the application; no AI platform re-provisioning is needed because domain agents query the data platform directly.

10. Evaluation and Results

10.1 Domain Onboarding Velocity

The primary evaluation criterion is domain onboarding velocity: how quickly can a new business domain be added to the system? We measured this during the onboarding of a compliance governance domain.

Task	Duration	Artifacts Created
Create 5 data agents on the data platform	~30 min	5 data agent artifacts
Add 5 entries to the agent registry	~10 min	Agent registry (5 new entries)
Create 5 grounding context files	~10 min	5 markdown schema files
Create 5 domain prompt templates	~5 min	5 markdown behavior files
Create the domain profile	~5 min	1 profile JSON file
Deploy and validate	~15 min	Application deployment
Total	~75 min	0 lines of code changed

Table 2. Compliance domain onboarding timeline. The entire process required only configuration files and markdown templates.

10.2 Reasoning Quality

The five-step pipeline with iterative evaluation measurably improves answer quality compared to single-pass generation. In production usage, approximately 15-20% of queries trigger at least one re-planning iteration. The most common cause is incomplete evidence gathering: the evaluator identifies that an additional data agent should have been consulted and triggers re-planning with that specific feedback.

The quality gate's weighted scoring provides a quantitative, auditable record of every evaluation decision. This auditability is particularly valuable in regulated industries where decisions must be explainable and defensible.

10.3 Latency Profile

End-to-end latency depends on query complexity:

Simple queries (1 data agent, no re-plan): 8-15 seconds
Moderate queries (2-3 agents, no re-plan): 15-25 seconds
Complex queries (3-5 agents, 1 re-plan): 30-50 seconds

The real-time streaming architecture mitigates perceived latency: users see each agent start, complete, and contribute evidence as it happens, maintaining engagement during multi-second reasoning cycles.

10.4 Cross-Domain Reasoning

The system's distinguishing capability is cross-domain reasoning within a profile. A single user query can trigger parallel evidence gathering from three to five data agents simultaneously. The synthesizer then correlates findings across all sources, applying temporal, confounding-variable, and counterfactual checks to distinguish genuine causes from coincidental correlations.

10.5 Memory Impact

Systems with persistent memory enabled show measurable improvement in response quality over time. After approximately 50 interactions within a domain, the orchestrator's agent routing accuracy improves as it recalls which data agents were most useful for specific query patterns. The synthesizer produces more contextually appropriate analyses by incorporating stored corrections and domain conventions. The evaluator benefits from accumulated quality patterns, reducing false-positive INSUFFICIENT verdicts on subsequent similar queries.

11. Lessons Learned

Building and deploying this system in a production enterprise environment yielded several insights that may benefit teams building similar architectures.

1. Let the backend control the loop, not the language model.

Early prototypes used the orchestrator language model to drive the entire reasoning loop, including dispatching agents and deciding when to stop. This approach was unreliable: the model would sometimes skip the evaluation step, dispatch agents outside the current profile, or enter infinite re-planning loops. Moving loop control to deterministic backend code where the model produces plans but the code controls execution dramatically improved reliability and predictability.

2. Agent descriptions matter more than sophisticated routing algorithms.

We experimented with embedding-based routing, tag-matching algorithms, and classifier-based dispatch. In every case, the orchestrator's routing quality depended more on the richness of agent descriptions than on the routing mechanism. Descriptions that list specific data fields, supported query types, and explicit exclusions consistently outperform vague capability summaries.

3. Memory contamination is a prompt engineering problem, not an infrastructure problem.

Rather than provisioning separate memory stores per domain, we achieved effective isolation through consistent tagging instructions across all pipeline steps. This approach is simpler and less expensive than infrastructure-level separation, though it depends on the language model's reliability in following tagging instructions.

4. Graceful degradation is essential for enterprise adoption.

Enterprise environments have complex authentication requirements that may not be fully configured at launch. The system's ability to detect missing permissions at runtime and fall back to service-level identity keeping data agents functional even when user-level identity passthrough is unavailable was critical for early adoption. Users could start using the system immediately while administrators completed the permissions setup.

5. Real-time streaming converts a liability into an asset.

A 30-second response time is unacceptable for a traditional chatbot. But when users can watch the system plan hypotheses, query data sources, and synthesize evidence in real time, the same 30 seconds becomes an asset: it demonstrates thoroughness and builds confidence in the answer. Streaming transforms latency from a negative user experience into evidence of rigorous analysis.

6. Persistent memory requires disciplined scope management.

The most subtle challenge with shared memory is not storage or retrieval it is scope. Without rigorous profile tagging at every memory touchpoint, insights from one domain inevitably leak into another. We learned that memory isolation must be treated as a first-class architectural concern, not an afterthought, and that every pipeline step that touches memory must enforce the same tagging discipline.

12. Future Work

Automated agent description generation from data platform metadata, removing the last manual step in domain onboarding.
Cross-profile reasoning that allows a single query to span multiple business domains (e.g., connecting cost data to compliance risk).
Quantitative A/B evaluation framework for systematically comparing pipeline variations such as different evaluation thresholds and hypothesis strategies.
Enterprise context integration (calendar, email, organizational data) to enable queries informed by the user's schedule and role.
Federated memory with explicit profile partitioning for stronger isolation guarantees in regulated environments.
Memory lifecycle management: automated archival, relevance scoring, and expiration of stale memories to maintain retrieval quality over time.

13. Conclusion

Enterprise AI does not fail because models lack capability. It fails because architectures do not match how decisions actually work. Decisions require reasoning across multiple data sources, weighing competing explanations, and validating conclusions not generating a single answer from a single source.

The architecture presented in this paper addresses this gap through four design principles. First, hypothesis-driven reasoning: the system does not generate answers in a single pass but formulates hypotheses, gathers evidence, and validates quality through a five-step pipeline with iterative re-planning. Second, config-driven scaling: new business domains are added through JSON and markdown files, not code, enabling domain experts to extend the system without engineering dependencies. Third, persistent memory: the system learns from past interactions, accumulating domain expertise and user preferences across sessions while maintaining strict isolation between business domains. Fourth, transparent reasoning: every step of the pipeline is streamed to the user in real time, building the trust required for enterprise adoption.

Production deployment across two business domains validates the scaling hypothesis: a new five-agent domain was onboarded in under 75 minutes with zero code changes. The architecture shifts the bottleneck from engineering capacity to domain configuration a task achievable by domain experts who understand the data, not platform engineers who understand the code.

The next frontier is not more automation. It is intentional decision system design: architectures that strengthen human judgment, preserve accountability, and earn trust through transparency. Multi-agent reasoning, governed by configuration, enriched by persistent memory, and validated by quality gates, offers a practical path forward.

References

Dorsch, J. and Moll, M. (2024). Explainable and human-grounded AI for decision support systems: The theory of epistemic quasi-partnerships. Philosophy of Artificial Intelligence. https://doi.org/10.48550/arXiv.2409.14839
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
McKinsey & Company. (2025). When can AI make good decisions? The rise of AI corporate citizens. https://www.mckinsey.com/capabilities/operations/our-insights/when-can-ai-make-good-decisions
National Institute of Standards and Technology (NIST). (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.100-1
Potapoff, J. (2025). AI in the driver's seat? Research examines human-AI decision-making dynamics. University of Washington Foster School of Business.
Stanford Institute for Human-Centered AI (HAI). (2025). AI Index Report 2025. Stanford University. https://hai.stanford.edu/ai-index/2025-ai-index-report
Wampler, D., Nielson, D. and Seddighi, A. (2025). Engineering the RAG stack: A comprehensive review of architecture and trust frameworks for retrieval-augmented generation systems. arXiv. https://arxiv.org/abs/2601.05264
Wu, Q., Bansal, G., Zhang, J., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv. https://arxiv.org/abs/2308.08155

About the author

Karthick Sundaram is a Senior Engineering Leader specializing in AI and enterprise data platforms, with a strong track record of delivering large-scale business impact. He leads high-performing teams at Microsoft, driving cloud-native architecture and agentic AI solutions across global supply chain systems. His work has enabled over $300M in efficiency gains by improving forecasting, fulfillment, and decision-making. Karthick is known for translating complex technology into scalable, secure, and business-aligned outcomes and go ahead and publish