Agentic AI for Process Capability

The problem

Three engineers. Three interpretations. One process.

North River Diagnostics manufactures diagnostic components with tight performance specs. Their engineers spent most of their analysis time on low-value work — extracting data, cleaning spreadsheets, running calculations — before they could think about what the numbers actually meant. And when they did interpret, they often disagreed.

⏱️

Slow response time

Capability and behavior assessments took 1.5–2.5 hours per process. Corrective action was delayed by days while engineers worked through the backlog manually.

🔀

Inconsistent interpretation

The same Cpk value was described as "capable enough," "marginal," and "not capable at all" by different engineers. Leadership received conflicting signals on the same process.

🕳️

Limited coverage

Only the most visible problem processes got regular reviews. Entire production lines ran with undetected risk because there weren't enough engineers to cover them.

⚠️

The core analytical failure

Capability indices like Cpk are only valid when the process is statistically stable. An unstable process makes Cpk meaningless — yet engineers routinely calculated and acted on Cpk without checking stability first. This single failure directed improvement resources at the wrong problem: firefighting variation instead of redesigning capability.

The workflow

Eight nodes. One decision pipeline.

Built in Azure AI Foundry Prompt Flow — a sequential workflow where each node has a single, clear responsibility. Stability is assessed before capability is interpreted. Every time, without exception.

Ingestion

1

data_access · Python

Cloud data retrieval

Connects to Azure Blob Storage and retrieves the process CSV. Decouples data access from analysis logic — the workflow pulls programmatically, mirroring real production systems.

2

process_data_formatter · LLM (GPT-4o)

Intelligent preprocessing

Uses an LLM with a Jinja2 prompt template to normalize raw CSV data — identifying measurement columns, aligning with spec limits (LSL, USL), and outputting a consistent data object for all downstream nodes.

Parallel analysis

3a

cap_metrics · Python + MCP Agent

Capability metrics calculation

On-the-fly agent with a Process Capability MCP tool. Computes Cp, Cpk, Pp, Ppk deterministically — not via LLM estimation. Results are auditable and identical across runs.

MCP · deterministic

3b

chart_creator · Python + MCP Agent

SPC chart generation

On-the-fly agent with a chart-generation MCP tool. Produces an Individuals chart and Moving Range chart showing process behavior over time. Outputs a URL for downstream visual interpretation.

MCP · visual artifact

↑ Nodes 3a and 3b run in parallel from the formatter output

Interpretation

4

process_behavior · Python + Multimodal Agent

Visual stability assessment

A vision-capable GPT-4o agent "looks at" the SPC chart image and applies process behavior rules — detecting points beyond control limits, runs, trends, and cycles. Produces a clear stability verdict before any capability interpretation begins.

Multimodal · vision reasoning

5

aggregator · Python → Persistent RAG Agent (Foundry)

Standards-grounded synthesis

Calls a persistent Aggregator Agent hosted in Azure AI Foundry with a RAG knowledge base built from NRD's internal capability guidelines. Synthesizes capability results and stability findings into a leadership-ready narrative — always referencing internal standards, never hallucinating thresholds.

RAG · governed interpretation

Output

6

report_generator + report_writer · Python

Structured report packaging

Transforms the aggregated narrative into a structured HTML report, publishable to Azure Blob Storage. Separates analysis logic from presentation — the same report format regardless of which process or dataset is analyzed.

Agent patterns

Three patterns. Each chosen deliberately.

The system demonstrates all three major agentic AI patterns — and the design explicitly matched each pattern to the task it handles best.

Pattern 1 · On-the-fly MCP

Deterministic calculation agents

Used for capability metrics (Cp, Cpk, Pp, Ppk) and chart generation. Created per-run, attached to a specific MCP tool, discarded after use. Reliable, auditable, identical output every time.

Why here: Numerical calculations must be deterministic. LLMs should never estimate statistics that have a right answer.

Pattern 2 · Multimodal on-the-fly

Visual stability interpreter

Uses GPT-4o's vision capability to analyze the SPC chart image directly. Detects out-of-control signals, runs, and trends the way a trained engineer would — from the visual pattern, not just numeric thresholds.

Why here: Stability assessment is inherently visual. Encoding all chart rules as numeric logic would be brittle and miss pattern-level signals.

Pattern 3 · Persistent RAG agent

Governed corporate expert

A persistent agent hosted in Azure AI Foundry with a knowledge base built from NRD's internal capability PDFs. Produces consistent, standards-aligned narrative across every run. Updating the RAG documents updates the "corporate standard" without touching the workflow.

Why here: Interpretation must reflect internal standards — not LLM priors. RAG is the mechanism that makes outputs governable and auditable.

Sample outputs

Two scenarios. Real process data. Opposite conclusions.

Both datasets are Hot Metal Delivery Times — 117 observations each, spec limits LSL 35 / USL 65. The workflow ran both and produced structurally identical reports with genuinely different findings.

Scenario A · Stable but incapable

Hot Metal Delivery Times — In Control

The process is predictable and stable — no points beyond control limits. But the variation is far too wide for the specification window. The correct action is process redesign, not firefighting. The agent made this distinction clearly and explicitly.

n observations117

process mean48.46 (target: 50)

within σ11.33

Cp0.441 — incapable

Cpk0.396 — incapable

Pp / Ppk0.411 / 0.369

OOC points (I-chart)0 — stable

MR violations1 (samples ~95–120)

Agent verdict: Statistically stable but incapable. Cpk valid to interpret. Root cause is variation width and centering, not instability. Action: redesign, not corrective action.

Scenario B · Unstable and incapable

Hot Metal Delivery Times — Out of Control

Five points beyond control limits on the Individuals chart. Five violations on the MR chart. The mean has shifted to 64.25 — near the upper spec limit. The agent correctly refused to anchor recommendations on Cpk and redirected focus to finding and eliminating assignable causes first.

n observations117

process mean64.25 (near USL 65)

within σ13.97

Cp0.358 — incapable

Cpk0.018 — near zero

Pp / Ppk0.270 / 0.014

OOC points (I-chart)5 — unstable

MR violations5 — high instability

Agent verdict: Statistically unstable. Cpk of 0.018 is not reliable — do not use for decisions. Action: eliminate assignable causes before any capability interpretation.

Design principles

Four rules that kept the system trustworthy.

🔒

Stability gating

The workflow enforces a hard rule: stability is always assessed before capability is interpreted. The process_behavior node runs before the aggregator receives any results. This cannot be skipped, regardless of how clean the data looks.

📖

RAG-grounded interpretation

The Aggregator Agent references NRD's internal capability guidelines — not LLM priors — for every threshold it cites. Updating the guidance PDFs updates the "corporate standard" without touching a single line of workflow code.

⚙️

Deterministic vs. AI reasoning split

Numbers that have a right answer (Cp, Cpk, control limits) are computed in Python via MCP tools. Language models handle only what they're good at: normalizing ambiguous data, interpreting visual patterns, and synthesizing narrative. Never estimating statistics.

👤

Human-in-the-loop by design

Every output is explicitly framed as a decision-support tool. The Aggregator always qualifies its conclusions with uncertainty language and requires engineer sign-off. The system never issues a recommendation without making the human's role in reviewing it explicit.

Personal reflection

Behind the build

The most important design decision was where to put the intelligence. I kept numerical calculations and chart generation in Python and MCP tools — they need to be deterministic and auditable. I used LLMs where flexibility genuinely adds value: data normalization, visual pattern interpretation, and narrative synthesis. That separation reduced hallucination risk while keeping the benefits of AI-driven reasoning where they actually belong.

The three agent patterns behaved very differently in practice. The on-the-fly MCP agents were the most predictable — reusable utilities for standardized analytics. The multimodal agent was effective at interpreting chart patterns but sensitive to prompt wording. The persistent RAG-backed Aggregator was where agentic AI clearly outperformed traditional scripting: reconciling stability, capability, and recommended actions into a consistent, standards-aligned narrative that any engineer would recognize as correct.

The broader lesson: the first pilot I would propose in any new domain would focus on explanation rather than automation. This project showed that trust is built through clarity before control. An AI system that explains its reasoning well — and is honest about its limits — gets adopted. One that doesn't, doesn't.

Agentic AI isn't about replacing engineers.It's about fixing the right problems.

Three engineers. Three interpretations. One process.

Eight nodes. One decision pipeline.

Three patterns. Each chosen deliberately.

Two scenarios. Real process data. Opposite conclusions.

Four rules that kept the system trustworthy.

Agentic AI isn't about replacing engineers.
It's about fixing the right problems.