Engineering AI Agents Production Systems

From Advisory to Autonomous: What It Actually Takes to Make AI Agents Safe in Manufacturing ERP

Shekhar Nirkhe — Co-Founder & CTO · March 10, 2026 · 12 min read

"Agentic AI" in manufacturing means the AI doesn't just surface a recommendation — it takes action. Reschedule a production run. Flag a PO for hold. Update a reorder threshold. Create a supplier non-conformance ticket. The whole pitch is that the system acts autonomously, without a human approving each step.

Every major ERP vendor is announcing this right now. SAP has Joule. Oracle has Fusion agents. Infor, Epicor, IFS — they're all using the same language: AI that perceives, decides, and executes inside a production ERP with minimal human input.

What nobody talks about in those announcements: what actually happens when an AI agent has write access to production data.

We've been building exactly this. Here's the engineering that doesn't make the press releases.

The Read-Write Gap

There's a category difference between an AI that reads ERP data and produces a recommendation, and an AI that writes back to ERP data and changes plant behavior. Most "agentic" demos sit on the read side — they pull inventory levels, scan a purchase order, surface a summary — and then post a nicely formatted recommendation to a Slack channel or dashboard. A human clicks approve.

That's hybrid advisory, not autonomous. It's useful, but it's not what the industry is promising.

True autonomous action means the agent routes around the "click approve" step. It writes directly: updates an ERP record, commits a schedule change, opens a corrective action ticket, adjusts a reorder parameter. The human is notified after, not before.

The engineering challenge shifts completely when you cross that line. On the read side, a wrong answer is embarrassing. On the write side, a wrong action is a production incident.

An AI agent with unconstrained write access to a manufacturing ERP is not dangerous because the model is untrustworthy in some abstract sense. It's dangerous because manufacturing systems have tight coupling between data and physical operations. A rescheduled work order triggers material picks, shifts labor, and moves equipment windows. A modified reorder point changes how much raw material gets purchased next week. Errors compound before anyone notices.

What We Built: The Action Layer

We built a layer that sits between the agent and the ERP. Every write the agent wants to make passes through it. The layer does four things:

Intent verification — re-checks the agent's proposed action against the original user instruction before allowing execution
Business rule validation — checks the action against domain constraints (inventory floors, supplier lead time windows, regulatory holds, approval thresholds)
Scope enforcement — the agent can only touch tables and records in a pre-declared scope. ERP data outside that scope is read-only, always.
Audit materialization — every write is logged with the agent's reasoning — not just what changed, but what the model said about why.

The agent never calls an ERP API directly. It calls action primitives we control. Here's what that looks like in simplified form:

One design constraint worth being explicit about: the agent's output is treated as untrusted data. The model returns structured JSON that gets parsed into a typed ActionRequest — a schema with known fields, known types, known constraints. Only that parsed object reaches the gateway. Free-form model text never touches execution logic directly.

This matters because any content the agent reads — a supplier PDF, an incoming email, a quality report — could contain text crafted to manipulate the model's output. If that manipulation succeeds and changes what the model returns, it still has to fit through a schema that doesn't have a field for "reassign approval authority" or "suppress audit entry." The schema is the first defense. The rules layer behind it is the second.

# Illustrative — not production code

# Agent has access to these action primitives only.
# Not direct DB access, not raw API calls.

class ActionGateway:
    def update_reorder_point(
        self,
        part_id: str,
        new_rop: float,
        reasoning: str,   # agent must supply its reasoning
    ) -> ActionResult:
        proposed = ReorderPointChange(
            part_id=part_id,
            current=self.db.get_rop(part_id),
            proposed=new_rop,
            delta_pct=...,
        )
        # Validate against business rules before any write
        verdict = self.rules.evaluate(proposed)
        if verdict.blocked:
            return ActionResult.blocked(reason=verdict.reason)
        if verdict.requires_confirmation:
            return ActionResult.pending(proposed, reason=verdict.reason)
        # If auto-approved: write, log with full reasoning
        self.db.set_rop(part_id, new_rop)
        self.audit.log(proposed, reasoning=reasoning, outcome="executed")
        return ActionResult.executed(proposed)

The agent doesn't know whether its action was blocked, held for confirmation, or executed. It gets back an ActionResult and decides what to do next. If blocked, it can try a different approach or surface the constraint to the user. If pending, it stops and waits.

There's a timing issue the code above glosses over. The agent made its decision based on ERP state it read seconds or minutes earlier. By the time the write reaches the gateway, something may have changed — a planner just manually cancelled the work order, a goods receipt posted that moved inventory above reorder threshold, a quality hold was placed on the supplier. The gateway re-fetches the records the action depends on at execution time and re-validates the preconditions before committing. If the world has moved, the action is rejected with the current state returned to the agent. Stale-read writes are one of the more insidious failure modes in any system that mixes human and automated operations on the same data.

Business Rule Validation: The Non-Obvious Part

Intent verification is the obvious thing — you check that the agent is doing what the user asked. Business rule validation is harder and more important.

Worth being direct about what this layer is: not an LLM. It's traditional imperative logic — Python conditions evaluating typed values, SQL queries against reference tables, date comparisons against calendar records. Using a second model to validate a first model's output is probabilistic checking on top of probabilistic generation. It's expensive, it's slow, and it can still be wrong. The business rules here are deterministic. Given a typed ReorderPointChange and the current part record, the verdict is the same every time, regardless of what the model was thinking when it proposed the action.

Manufacturing operations have constraints that aren't in any database schema. They live in operating agreements, procurement contracts, regulatory requirements, and plant floor knowledge. An AI agent has no visibility into these unless you explicitly encode them.

Some examples of rules we've had to encode:

Minimum stock floors by part criticality. Some parts are sole-sourced with 14-week lead times. The reorder point can never be lowered below a floor derived from that lead time, even if consumption data suggests it's safe. An AI looking only at recent consumption history will underestimate the risk.
Change freeze windows. Most plants have periods where ERP changes are prohibited — end-of-month close, FDA audit windows, customer delivery freeze weeks. Any write during these windows, regardless of correctness, is a protocol violation.
Supplier commitment locks. If a blanket PO is in place, modifying a reorder point that would trigger a new PO below the committed volume creates a contract problem. The agent doesn't know the blanket PO exists unless you tell it.
Approval thresholds by value. A buyer may be authorized to approve POs under $50K; above that requires a VP signature. An AI that generates a PO without checking this bypasses the approval chain the company thinks is in place.
Regulatory holds. In pharma and food manufacturing, some materials are under quality hold — they can't be allocated, consumed, or reordered without QA sign-off. Any scheduling action that touches these materials needs to be intercepted.

These rules can't all be written ahead of time — they're customer-specific, change with regulatory audits and supplier agreements, and are sometimes contradictory. We ended up building a rule authoring interface that lets ERP admins define them in a structured format, so the gateway can evaluate them at runtime without a code deploy.

# Illustrative — business rule definition format

rule:
  id: min_rop_sole_source
  applies_to: update_reorder_point
  condition: part.supplier_count == 1
  constraint: |
    proposed_rop >= CEIL(
      part.avg_daily_usage * part.supplier_lead_days * 1.3
    )
  on_violation:
    action: block
    message: |
      {part_id} is sole-sourced. ROP cannot fall below
      {min_rop} (lead time * 1.3 safety factor).

rule:
  id: change_freeze_window
  applies_to: [update_reorder_point, create_purchase_order, reschedule_work_order]
  condition: current_date in plant.change_freeze_windows
  on_violation:
    action: block
    message: Plant is in change freeze until {freeze_end_date}.

The Three-Tier Approval Model

Not every action needs the same level of scrutiny. We settled on a three-tier model:

Tier	When it applies	What happens	Examples
Auto-execute	Action is within pre-approved scope, below a value threshold, passes all rules	Executes immediately. Logged. User notified in summary.	Reorder flag below $5K, updating a part attribute, creating a draft NCR
Confirm-first	Action is valid but above value threshold, or touches high-criticality records	Pauses. Surfaces proposed action and reasoning to a human approver. Executes on approval.	PO creation over $50K, any scheduling change affecting >20 work orders
Blocked	Action violates a business rule or is outside the agent's declared scope	Rejected. Agent receives a structured explanation and can surface it to the user or try a different path.	Modifying a part under regulatory hold, writing outside the declared table scope, triggering a supplier contract violation

The thresholds and scope for each tier are configured per customer, per agent. An agent deployed for production scheduling gets a different scope than one deployed for procurement analytics. They share the same gateway; the rules are just different.

An important design decision we landed on: the agent doesn't get to negotiate tier assignment. If a proposed action lands in confirm-first, the agent can't argue its way into auto-execute. The human always gets the final call on actions above the threshold. This is intentional — it keeps the human authority boundary clear even as the agent's autonomy expands.

Audit Logs That Actually Carry Information

Most systems log what changed. We log why the model said it should change. That distinction matters for manufacturing more than it might in other domains, because questions like "who changed our reorder point for BRG-4401 last Tuesday and why" have real operational consequences — they surface in supplier audits, corrective action investigations, and regulatory reviews.

Every action the gateway executes records:

# Illustrative audit record structure

{
  "action_id": "act_8f2kp...",
  "timestamp": "2026-03-17T14:32:01Z",
  "agent_id": "procurement-agent-v4",
  "action_type": "update_reorder_point",
  "subject": { "part_id": "BRG-4401", "site": "SITE-03" },
  "change": {
    "field": "reorder_point",
    "before": 40,
    "after": 55,
    "unit": "units"
  },
  "model_reasoning": "Consumption over the last 90 days averaged 6.1 units/day,
    up from 4.4 units/day in the prior period. At a 6-day supplier lead time
    with a 1.5x safety multiplier (ROP = daily_usage * lead_days * 1.5),
    the correct ROP is CEIL(6.1 * 6 * 1.5) = 55. The previous value of 40
    was set when consumption was 4.4/day (CEIL(4.4 * 6 * 1.5) = 40).",
  "triggered_by": { "user_instruction": "Review and correct any reorder points
    that are out of sync with current consumption rates." },
  "rules_evaluated": ["min_rop_sole_source", "change_freeze_window"],
  "rules_passed": ["min_rop_sole_source", "change_freeze_window"],
  "tier": "auto-execute",
  "outcome": "executed"
}

That reasoning field is the one that takes engineering effort to get right. The model has to be prompted to produce a specific, factual explanation tied to the data it used — not "I updated this because it seemed low." When we pulled audit logs with a human reviewer for the first time and they could tell exactly what the model had considered, including which rule it had evaluated and passed, it changed how they thought about trusting the system.

Dry-Run Mode

Before any customer goes live with write-enabled agents, they run in dry-run mode. Same agent, same rules, same gateway — but writes are journaled instead of committed. The customer gets a log of every action the agent would have taken, with its reasoning, and reviews it manually for a week or two.

The review surfaces rule gaps — actions the customer wants blocked that the rules don't catch, and actions the rules are blocking that should be fine. It also builds operational trust in a way a demo never does. Operators who've worked through 200 dry-run logs and found the reasoning sound are in a genuinely different headspace when they flip to live mode.

We found that dry-run review consistently catches 3–5 things the customer hadn't thought to encode as rules. The most common miss: parts that are on a slow wind-down from a product line that's being discontinued. The consumption history looks normal, the reorder model would keep them stocked, but the right action is to let inventory run down without reordering. That business context doesn't exist in the ERP schema — it exists in someone's head. Dry-run surfaces it.

The Scope Declaration Problem

Scope enforcement is conceptually simple — the agent can only touch declared tables and records. In practice it's subtle.

The issue is that ERP data is heavily referenced. A work order connects to a routing, which connects to a work center, which connects to a calendar, which affects capacity planning for every other work order in the queue. You can declare "this agent can reschedule work orders" and think you've been specific. But rescheduling a work order has cascading reads and sometimes cascading writes across a much wider surface than the declared scope suggests.

Our current approach: scope declarations are verified by a static analysis pass over the action primitives the agent has access to. Every database table those primitives touch — directly or through foreign key traversal — is surfaced during agent configuration. The operator then explicitly approves or restricts each table in that set. It's tedious the first time. It catches the coverage gaps you'd otherwise find in production.

We also added runtime scope shadowing: if an action primitive tries to write to a table not in the approved set, it's blocked at the gateway layer even if the business rules passed. This is a defense-in-depth measure — it catches cases where a new code path in a primitive reads more data than the static analysis anticipated.

Partial Failures and Idempotency

An agent working through a multi-step workflow — check stock levels, find an alternative supplier, draft a PO, attach a non-conformance record — is issuing a sequence of actions, not one. If the system crashes or the agent is interrupted after step two, you have a half-executed transaction in your ERP. Or you retry, and now a second PO is created on top of the first.

We handle this two ways. First, every action primitive in the gateway is idempotent by construction. Calling create_purchase_order twice with the same arguments in the same workflow session returns the existing PO rather than creating a duplicate — the gateway checks for an existing record with a matching idempotency key before writing. Second, multi-step workflows have a transaction lifecycle: steps are checkpointed as they complete, and if the workflow is resumed after a crash, it picks up from the last confirmed checkpoint rather than restarting from scratch.

The edge case that bit us before we had this: an agent that timed out mid-run was retried by the user. The first run had completed steps 1 and 2 but not step 3. The retry ran all three steps again. Two supplier outreach emails went out to the same contact 40 seconds apart. Not catastrophic, but embarrassing — and the kind of thing that erodes trust in an automated system faster than almost anything else.

Where the Human Stays Critical

There are three categories of decision we will not automate, even in principle, for current manufacturing customers:

First-time exceptions. If the agent has never seen a pattern before — a new type of supplier failure, a new regulatory citation, a demand spike with no historical analog — it flags for human review. Automation is appropriate for recurring patterns where learned behavior is trustworthy. Novel situations require human judgment.
Cross-functional impact. Actions that affect people across departments — a schedule change that moves a shift's overtime, a procurement decision that affects a supplier relationship, a quality hold that stops a customer shipment — need human authorization even if the business logic validates. The people affected deserve to know, and to object, before the system moves.
Irreversible or hard-to-reverse writes. Marking a batch as non-conforming in a regulated industry, committing to a supplier quantity, confirming a shipment — these create downstream obligations that are expensive or impossible to unwind. The confirm-first tier exists for exactly these.

These aren't permanent restrictions — they change as customers build operational confidence and we build more robust validation. But getting this wrong early destroys trust in the system in a way that's very hard to rebuild. The right answer is to start narrow and expand, not to start broad and clamp down reactively.

The actual goal of the Action Layer isn't to prevent the agent from taking actions. It's to make the agent's actions legible enough that operators can reason about them — review them, audit them, understand why they happened, and confidently expand or restrict scope over time.

An autonomous system that nobody understands isn't autonomous. It's a black box pretending to be autonomous. The audit trail and rule transparency are what make genuine autonomy extensible.

What This Opens Up

Once you have the Action Layer, write-enabled agents become useful in ways that pure advisory AI never achieves. A few things we're running or actively building:

Self-adjusting reorder parameters. The agent runs a weekly review of consumption rates, recalculates correct reorder points, and auto-applies changes within tolerance. Buyers are notified of the summary, not individual changes. A buyer who used to spend Thursday afternoons auditing reorder points stopped doing it.
Autonomous supplier non-conformance initiation. When a goods receipt fails an inspection check, the agent creates the NCR in the quality management system, populates it with the inspection data, and routes it to the responsible supplier contact. Human QA engineer reviews and closes — they don't create.
Demand spike re-planning. A sudden demand shift triggers the agent to identify affected work orders, calculate lead time exposure, and propose a revised schedule. Above a delta threshold, it goes confirm-first. Below it, it writes and notifies. The planner sees a morning summary instead of spending three hours resequencing manually.

Most of these have been on ERP vendor roadmaps for years. The gap was never the idea — it was having a framework where you could actually deploy them safely enough that a plant manager is comfortable leaving the system running overnight without someone watching it.

What We'd Build Differently

A few things we got wrong the first time:

We under-invested in the rule authoring interface early on. We hardcoded business rules as Python for the first few months. Every customer needed an engineering deployment to change a rule. That was unacceptable at scale — rule changes happen constantly as operating conditions shift. Moving to a declarative rule format that business users can edit was the right call but we waited too long.

We didn't track rule evaluation rates. We know whether an action passed or failed rules. We didn't originally track which rules fired and how often. That data turned out to be critical — a rule that fires on 80% of actions and always passes is probably misconfigured. A rule that has never fired might be dead code covering a scenario that no longer exists. Add evaluation telemetry from day one.

Dry-run was an afterthought. We built dry-run mode because a large customer asked for it before go-live. It should have been the default from the beginning. Every agent should debut in dry-run. The activation to live mode should be the intentional step, not the default.

The Status

The Action Layer is production for us — it's the infrastructure our write-enabled agents run through. The gateway, the three-tier model, the business rule engine, the audit trail, the dry-run mode: all running in customer environments.

The hard part of agentic AI in manufacturing isn't the model and it isn't the agent architecture. It's the trust infrastructure — the thing that tells operators what the agent did, why it did it, and which constraints it respected. Without that, nobody with real accountability for plant operations will let the system run unsupervised. They're right.

If you're working on similar problems — write-enabled agents in production systems, constraint modeling for autonomous actions, or building audit trails for non-deterministic systems — we'd like to compare notes. hello@zeehub.ai.

Written by Shekhar Nirkhe, Co-Founder & CTO at ZeeHub. Previously Staff Engineer at Meta. Building agentic AI for manufacturing at zeehub.ai.