How it would scale

Suggested production architecture

The same layered design, hardened for a real customer. Crucially, everything runs inside the plant network so data never leaves.

What changes for production

On-prem LLM (data privacy). Replace the cloud Claude API with a self-hosted open-weight model (Llama / Mistral via vLLM) inside the plant network, so operational data never leaves. The assistant layer is intentionally thin, so a smaller local model suffices, and it is swappable behind one env var.
Multi-model routing (faster, cheaper, more accurate). Split the two LLM jobs across models: a stronger model for tool routing and SQL reasoning (small output, so the premium costs little), and a leaner, faster model for narrating rows (the large-output step). Combined with prompt caching, this gives more accurate SQL on hard questions, faster answers, and lower cost, with each model still swappable behind one env var.
Real data ingestion. Stream live sensor data from the historian / PLCs into Exasol (CDC), instead of mock data. The SQL views stay the same shape.
Pooled connections. A long-running backend holds a pooled Exasol connection instead of the serverless per-request connect.
Security & governance. SSO + role-based access, per-plant data scoping, secrets in a vault, network controls (VPC / security groups) instead of an IP allowlist.
Operationalize. Configurable, plant-specific thresholds, alerting and escalation, scheduled monitoring, and auto-scaling or multiple clusters for workload isolation.

Assumptions for the production design

On-prem GPU. The customer can host a GPU on-premises (or in their private cloud) to run the local LLM, sized for the chosen model.
A small model is enough. A self-hosted open-weight model (Llama or Mistral) is good enough for this thin routing and narration task, validated by evaluation. Smaller models may need a stricter router and some prompt tuning.
Exasol inside the network. Exasol runs on-premises or in the customer’s private VPC, with network access to the plant data sources.
A data pipeline exists. A real-time or near-real-time ingestion pipeline (historian, PLCs, CDC) exists or can be built to feed Exasol, and the SQL views keep the same shape.
Identity provider. An SSO provider and role definitions exist, enabling authentication and per-plant data scoping.
Ops capacity. There is capacity to run, patch, and monitor the on-prem stack: database, model server, and app.
Data must stay local. Security and data-residency policy requires data to stay inside the plant network, which is the reason for an on-prem LLM rather than a cloud API.
Calibrated thresholds. Risk thresholds and weights will be calibrated with the customer’s engineers, replacing the prototype’s illustrative defaults.