Engineering Architecture AI Agents

How ZeeHub Uses a Skills Architecture to Handle Any Manufacturing Use Case on the Fly

Shekhar Nirkhe — Co-Founder & CTO · March 17, 2026 · 10 min read

One of the stranger parts of building for manufacturing is that the requests look completely unrelated. "Which parts are below reorder point?" is a multi-table SQL problem over consumption history and lead times. "Does this supplier's FDA letter cover our allergen spec?" means reading a PDF and cross-referencing structured data. "Translate this inspection report from German" is a document transformation pipeline. Same platform, same session, entirely different execution paths.

Our first approach was to give a single agent everything — all the tools it might conceivably need. That worked fine in demos. In production it created a new problem: the agent spent too much time reasoning about which tool applied, the context window filled up with irrelevant descriptions, and every new capability meant editing the agent itself. It was getting unwieldy fast.

So we pulled the capabilities out into independent units we call Skills — self-contained, self-describing, and completely decoupled from the agents that use them.

What a Skill Actually Is

Every Skill in ZeeHub implements a three-part contract: a name the registry uses to look it up, a description the language model reads when deciding whether to invoke it, and an execute method with a typed input/output signature. That's the whole interface.

In practice it looks roughly like this:

# Illustrative — not production code

class InventoryReorderSkill(Skill):
    name = "inventory_reorder_check"
    description = """
        Calculates which parts are below their reorder threshold.
        Use when the user asks about stock levels, replenishment,
        parts running low, or reorder points. Accepts an optional
        criticality filter (critical / non-critical / all).
    """

    def execute(self, company: str, filter: str = "all") -> SkillResult:
        ...

The description is load-bearing. The LLM reads it verbatim when choosing between available Skills. "Which parts need reordering?" matches inventory_reorder_check. "Do we have the PFAS declaration from supplier X?" matches the compliance Skill. The model makes that call — no routing logic, no intent classifier, no keyword matching. Getting descriptions right matters more than most of the implementation work.

Something we learned the hard way: the description is what the model actually uses to choose between Skills. We spent a lot of time early on debugging wrong-tool invocations before realizing the descriptions were too vague or too similar to each other. We've rewritten descriptions more often than execute methods at this point.

The Skills in Production Today

Six Skills are running in production. What's interesting is how different the internals are — one runs deterministic SQL from a config file, another has the LLM write transformation code on the fly, another wraps its own sub-agent. The interface is the same for all of them.

📦 Inventory Reorder

SQL-based. Config-driven formula chain: consumption rate → lead time demand → safety stock → reorder point. Runs deterministic SQL. Returns structured part list with gap values.

✅ Compliance Check

Sub-agent. Spins up an inner agent with PDF vector search + structured spec query tools. Maps findings to a regulatory disclosure template. Outer agent sees one call.

🧹 Data Normalizer

LLM-generated transforms. Samples the file, identifies quality issues, generates a per-file transformation script, applies it. Picks up site-specific rules from the working directory.

🔍 Document Comparison

Diff + LLM. Structural diff of two document versions (spec sheets, SOPs, contracts). LLM summarizes semantic changes on top of the raw diff output.

🌐 Doc Translation

TM-backed LLM. Looks up a per-customer Translation Memory before hitting the model. Stores new translations back. Builds a domain-specific glossary over time from real documents.

🔎 Root Cause Analysis

LLM + structured data. Accepts a production event description + optional data context. Currently generates hypothesis lists. Expanding to causal graph reasoning over event sequences.

Registry, Dispatch, and the Zero-Routing Rule

Skills are registered by name at startup. Agents request Skills by name — never by importing a class directly. The registry handles construction, injecting any dependencies (database handles, config paths, credentials). At runtime the agent wraps each Skill as a callable tool and hands the full list to the model. The model reads the names and descriptions, picks the right one, extracts the arguments, and returns a structured call. The agent executes it. That's the entire dispatch path.

# Illustrative

registry.register("inventory_reorder_check", InventoryReorderSkill)
registry.register("compliance_check",        ComplianceSkill)
registry.register("data_normalizer",         DataNormalizerSkill)
registry.register("doc_translation",         TranslationSkill)

# Agent at init — only loads what this tenant actually uses
agent = WorkflowAgent(skills=["inventory_reorder_check", "doc_translation"])

User: "Which parts are below reorder point — critical parts only"

↓

Agent — passes Skill list + user message to model

↓ LLM selects Skill + extracts args from natural language

inventory_reorder_check(filter="critical")

↓ runs deterministic SQL, returns structured rows

[ { part: "BRG-4401", stock: 12, rop: 40, gap: -28 }, ... ]

↓

Agent formats → "3 critical parts below reorder point. Worst: BRG-4401 (28 units short)"

We haven't written a single line of routing logic. No intent classifier, no keyword matching, no conditional dispatch. Adding a new Skill means writing the class, registering it, and it's immediately available to any agent that requests it by name.

SQL-Based Skills: Deterministic Logic in a Readable Config

Some manufacturing KPIs have well-defined, correct formulas. Reorder point isn't a judgment call — it's consumption rate × lead time + safety stock, adjusted for open purchase orders. We don't want an LLM approximating that. We want it executed exactly, every time.

For these, we use a config-driven Skill pattern. The formula logic lives in a structured definition file — human-readable, reviewable by domain experts without touching application code. The Skill framework reads it and executes it as parameterized SQL. Here's what that looks like in simplified form:

# Illustrative config (not production)

skill: inventory_reorder_check
description: Flags parts below reorder threshold using consumption + lead time data

parameters:
  lookback_days: 90      # how far back to measure consumption
  safety_factor: 1.5     # buffer multiplier, editable per customer

formulas:
  daily_usage:      total_consumed / lookback_days
  lead_time_demand: daily_usage * avg_supplier_lead_days
  safety_stock:     daily_usage * avg_supplier_lead_days * safety_factor
  reorder_point:    CEIL(lead_time_demand + safety_stock)

flag_condition:  current_stock <= reorder_point
output_columns:  part_code, description, current_stock, reorder_point, gap

The Skill reads this config, materializes the formula chain as a SQL query with CTEs, substitutes the customer's actual table names, and runs it. Structured list of parts needing action comes back out.

Adjusting the safety factor or the lookback window is a config edit — no code change, no redeploy. More usefully: when a customer's ops team asks how reorder point is calculated, we can hand them this file. They can read it. That conversation goes very differently than "trust us, the AI figured it out."

The Normalizer: LLM-Written Transformations Per File

ERP exports are messy in ways that vary unpredictably. One customer's goods receipt file uses "1,500.00" as a number. Another uses 1500. A third exports dates as 3/7/26 in one column and 03-08-2026 in another. You can't write a single pipeline that handles all of it — the specific issues change with every customer and every ERP version bump.

We tried maintaining per-customer transformation scripts for a while. That got old quickly. The Normalizer Skill does it differently: sample the file, have the LLM identify what's broken, generate a targeted fix for that specific file, apply it. Here's roughly what that loop looks like:

# Illustrative — what the Skill does internally

# Step 1: Sample the file and identify issues
issues = llm.analyze("""
  File: goods_receipt_export_2026-03-15.xlsx
  Sample (5 rows):
    | qty_received | unit | receipt_date | part_no    |
    |  "1,500.00"  | "ea" |   "3/7/26"   | "BRG4401"  |
    |  "200.00"    | "EA" |  "03-08-26"  | "BRG-4401" |
    |  "50.00"     | "Ea " |  "3-9-2026" | "brg4401"  |
""")

# LLM identifies:
#  - qty_received: string with locale commas → should be float
#  - unit: inconsistent casing and whitespace → normalize to uppercase, strip
#  - receipt_date: mixed formats (M/D/YY, MM-DD-YY, M-D-YYYY) → parse to ISO 8601
#  - part_no: inconsistent casing and hyphenation → normalize to upper + add hyphen

# Step 2: Generate and run the fix
transform_script = llm.generate_transform(issues)
apply(transform_script, source_file, output_file)

The generated script runs immediately if the output clears validation, otherwise it's flagged for review. Customer-specific conventions — "part numbers at this site don't use hyphens", "qty fields are always in dozens" — can be dropped in as a plain-text rules file and the Skill picks them up on the next run without any config changes.

The thing we didn't anticipate: because the transformation is generated fresh per-file, it just works when the source system changes its export format. We've had ERP vendors push schema changes that would have broken a static pipeline. The Normalizer figured it out on the next file without us touching anything.

When a Skill Runs Its Own Agent

The Compliance Skill is probably the most interesting one architecturally. From the outside it's a simple callable — instruction in, structured compliance report out. Under the hood it spins up its own sub-agent with its own specialized tool set: vector search over uploaded supplier PDFs, structured queries against the spec database, and a formatter that maps findings to a regulatory disclosure template.

# Illustrative — Skill that wraps its own agent

class ComplianceSkill(Skill):
    name = "compliance_check"
    description = "Checks supplier documents against a regulatory or
                   allergen specification. Returns a structured report
                   mapping findings to the required disclosure template."

    def execute(self, instruction: str, company: str) -> SkillResult:
        # Internally: spin up a sub-agent with specialist tools
        sub_agent = Agent(tools=[
            SupplierPDFSearchTool(company),   # vector search over uploaded docs
            SpecDatabaseQueryTool(company),   # structured allergen/regulatory data
            ComplianceReportFormatter(),      # maps results to disclosure template
        ])
        return sub_agent.run(instruction)

The outer agent has no idea a sub-agent ran. It just gets a result. The practical benefit is that the outer agent's context window stays clean — it doesn't see the sub-agent's tool calls, intermediate steps, or the back-and-forth over conflicting supplier documents. All of that is contained inside the Skill.

Failure Modes We've Hit in Production

The architecture is clean in theory. In practice these are the failure categories that have actually bitten us.

Description collision

The most common failure: two Skills with descriptions that overlap enough for the model to pick the wrong one. Early on, a query like "which suppliers are sending us short shipments?" landed in the Inventory Reorder Skill instead of Compliance because both descriptions mentioned suppliers and parts. The fix is always tightening the description — but it took us a while to realize we were debugging a wording problem, not a model capability problem. We now explicitly test description disambiguation as part of Skill QA: give the agent the new Skill alongside every existing one, run 20 edge-case prompts, confirm zero mis-dispatches before registering.

Silent data corruption in the Normalizer

The Normalizer generates a transformation script and applies it. If the LLM misidentifies an issue — say, it decides product weights in a column are in grams and converts to kilograms, but they were already in kilograms — the transform silently corrupts data. The script passes its own validation check, the file looks clean structurally, and the error surfaces downstream when someone notices inventory quantities are off by 1000x.

Our mitigation is a post-transform statistical check: we compare numeric column distributions before and after and flag anything where the mean or range shifts beyond a configured threshold. For high-stakes files we require a human sign-off on the generated script before it runs.

Sub-agent loops in Compliance

The Compliance Skill's inner agent occasionally loops — it searches for a document, finds an ambiguous result, reruns the search with a slightly different query, and repeats. We added a max-iterations guard and a "unable to determine" fallback that surfaces whatever intermediate findings exist rather than letting the agent spin indefinitely. A compliance report that says "searched these three documents, found conflicting allergen declarations, could not resolve" is still useful to a reviewer.

Formula edge cases in SQL Skills

New parts with no consumption history hit division-by-zero in the daily_usage formula. Parts recently added to the BOM but never ordered produce null lead time lookups. These aren't LLM problems — they're SQL edge cases that would bite any data pipeline. We added explicit null guards and a separate output bucket for parts with insufficient data. The config format now has a dedicated edge-case section:

# Illustrative

edge_cases:
  no_history:     flag as "new_part", skip reorder calculation
  null_lead_time: use category_default_lead_days fallback
  zero_safety:    floor at min_safety_stock_days * daily_usage

Context window pressure with many Skills

When an agent has more than roughly 8 Skills loaded, the model's ability to reliably distinguish between similar-sounding descriptions degrades. We found the practical ceiling around 6–8 Skills per agent before dispatch accuracy drops. Our response: agents are configured only with Skills the tenant's active workflows actually use. A customer who's never touched document translation doesn't load that Skill.

How We Measure Skill Quality

Each Skill type needs a different evaluation approach. Deterministic Skills have objectively correct answers. Generative Skills need something smarter.

SQL Skills: golden set regression

For the Inventory Reorder Skill we maintain a golden test set — 15 representative inventory snapshots with known expected outputs: which parts are below threshold, their gap values, criticality classification. Every run against a test snapshot is compared exactly to the expected output. Any deviation — different row count, wrong gap value, different sort order — fails. These run automatically on every config change and every model update, before anything reaches production.

Normalizer: LLM-as-a-judge with a rubric

The Normalizer's output can't be evaluated with exact match — a correctly cleaned file looks right but there's no single ground truth. We use an LLM judge with a structured rubric:

# Illustrative evaluation rubric

judge_prompt: |
  Evaluate the cleaned data file. Score each criterion 0–3:

  1. Type correctness    — are numeric columns numeric? dates parseable?
  2. Value preservation  — did any non-null values get dropped or nulled?
  3. Format consistency  — same format in every row of each column?
  4. Part number format  — do part numbers follow the expected site convention?
  5. No hallucinations   — did the transform fabricate values not in the source?

  Return: { criterion: score, notes: "..." } for each.

Score 2+ on all five criteria passes. Criterion 5 is a hard gate — a zero there blocks the file regardless of other scores. Over the last 90 days: 94% of files passed on first attempt. The remaining 6% were flagged for human review with the judge's notes surfaced inline to the reviewer.

Translation: domain fluency, not BLEU

BLEU is useless for manufacturing documents. A translation that scores well on BLEU can still get "corrugated liner" wrong in a way that matters. We use an LLM judge with three criteria: terminology accuracy against a domain reference, fluency, and consistency with the customer's existing Translation Memory. Any segment that diverges from a TM entry with high confidence is flagged for review — because if the TM was right before, the new translation probably isn't.

We also track TM hit rate per customer over time. Rising hit rate means the system is building a genuine domain glossary. A flat hit rate after several months usually signals that the documents being processed are too varied for useful memory accumulation — which is its own useful signal.

Tracking drift across model upgrades

Every model update triggers a full Skill suite run against the golden sets and a sample of recent real inputs. We track a per-Skill accuracy score and alert if any Skill drops more than 2% from its baseline. When we moved to a newer checkpoint last quarter, the Normalizer's generated transformation code changed subtly — same logical result, different column ordering in the output — which broke a downstream pipeline that expected a fixed schema. We caught it in staging. Without the automated suite we would have found out from a customer.

What Actually Changed for Us Engineering-Side

Before this, shipping a new workflow meant touching a lot of things at once: a new agent class, new tool wiring, a new system prompt, API routing changes, end-to-end tests. You couldn't really validate anything until the whole stack was assembled. Iteration was slow.

A new Skill is just its own class. We can test it in isolation with fixed inputs before any agent ever calls it. Registration is one line. The agent layer and API don't change. That alone has made a meaningful difference in how fast we can ship new workflows.

Composition also works better than we expected. "Clean this goods receipt file, then check reorder levels" runs as two Skill calls in sequence — the output path from the normalizer becomes the input for the inventory Skill. The agent orchestrates the handoff. Neither Skill knows the other exists.

What We're Working on Next

A few things in active development:

Typed chaining — the output schema of one Skill becomes a validated input for the next, with the agent orchestrating the sequence. Right now chaining works but it's manual; we want it to be declared and enforced.
Operator-defined Skills — the SQL/formula config is readable enough that a non-engineer could write one. Getting there for real means building a validation layer and a way to test a new config before it runs in production.
Snapshot testing per Skill — we do this manually today. We want it automated so every model update runs the full Skill suite before deploy. The goal is catching output drift before it reaches a customer.
RCA v2 — the current version is basically "LLM + structured prompt over a production event description." Useful but shallow. The version we're designing builds a causal graph over event sequences, shift transitions, and machine state changes. Different problem.

The Skill interface itself probably won't change much — name, description, typed execute method. What changes is everything layered on top: better chaining, better testing, more of the logic living in config rather than code.

If you're working on similar problems — AI in industrial environments, deterministic vs. generative logic, agent tool design — we'd genuinely like to compare notes. Reach us at hello@zeehub.app.

Shekhar Nirkhe is Co-Founder and CTO of ZeeHub AI. He spent a decade building production AI systems at Meta before starting ZeeHub with Nilesh Dixit. Connect on LinkedIn.