Offline Grounded Domain Agent

The reason mode separation matters is not rigor for its own sake. It is that collapsing all local results into one vague category ("local LLM result") makes every claim uninterpretable. These boundaries are what make results reproducible and trustworthy.

Mode Summary

Mode	What the model sees	Default use	Claim boundary
`raw`	User request only	Baseline / debug	Does not prove domain usefulness
`grounded`	Request + verified local context	Default user-facing mode	Proves constrained usefulness, not independent reasoning
`artifact`	Answer returned from validated precomputed layer	Stable demos, presentations	Proves validated answer availability, not live execution
`implementation_agent`	Model runs inside deterministic workflow	Escalation / traceability	Proves workflow capability, not raw-model strength

These are not a quality ranking. They are a claim boundary map. A grounded result is not "better than" a raw result in every context — it is a result with a different, clearer claim about what it proves.

Raw Mode

What it is: The local model sees only the user question. No verified context is injected first.

What it proves

Baseline local-model behavior on its own. Whether the model can answer plausibly without grounding.

What it does not prove

Deterministic correctness
Reliable domain usefulness
Tool-use capability

Typical failure mode

Plausible but wrong answers. Omission on list or set questions. Refusal on precise statistical questions.

Correct use

Baseline and debugging only. Raw results do not justify claims about domain usefulness. If you are tempted to publish a raw result as a domain capability claim, you are miscategorizing the experiment.

Grounded Mode

What it is: The local model sees the question plus verified domain context first. In TourAgent this means a deterministic answer seed, tool path, and evidence bundle injected into the prompt before the model responds.

What it proves

Constrained local usefulness. Whether the model can produce a good user-facing answer when facts are already verified. The practical value of grounding without full workflow execution.

What it does not prove

Independent reasoning strength of the model
Live tool execution capability
General question-answering outside the grounded surface

Typical failure mode

Good wording over a fixed validated surface, but no ability to go beyond it honestly. The model is a constrained answer renderer here, not an independent reasoner.

Why it is the default

Grounded mode balances usability, rigor, and local feasibility. The model adds presentation quality on top of deterministic correctness. That is a real contribution without overclaiming what the model is doing.

Artifact Mode

What it is: The system returns an answer from a validated precomputed answer layer. No live tool execution happens at answer time.

What it proves

Stable validated answer availability. Repeatable presentation of a frozen or overlaid answer surface.

What it does not prove

Live tool execution
Live agent reasoning
Model capability

Typical failure mode

Overclaim if presented as though tools are running live. An artifact answer is fast and stable precisely because it is not running live — that is a feature, not a limitation, but it must be declared.

Correct use

Stable demonstrations, presentations, frozen validation surfaces. The TourAgent CLI --mode agent flag is actually artifact mode — the internal implementation label and the audience-facing label are deliberately separated.

Implementation-Agent Mode

What it is: The local model operates inside a deterministic workflow. That can include tools, file reads, validation checks, logging, and artifact preservation. The model is one step in a traceable pipeline, not a freeform answerer.

What it proves

Workflow capability under controlled local constraints. This is the strongest local claim among the current modes — the most meaningful one for capability claims because it shows the model contributing usefully inside a real deterministic system.

What it does not prove

Raw-model strength
General intelligence outside the workflow
That the same result would happen in raw or prompt-only mode

Typical failure mode

Slow recovery on multi-step tasks. Higher scaffolding cost even when the final answer is correct. This mode earns the most trust but costs the most execution time.

When to use it

When traceability matters. When the answer will be saved as an artifact. When the user needs to know exactly which tools ran and what evidence supported the answer.

Audience Translation

For external explanation, the cleanest wording for talks, papers, and demos:

Internal label	Plain-language equivalent
`raw`	Model alone
`grounded`	Model plus verified context
`artifact`	Validated answer layer
`implementation_agent`	Controlled local workflow

The internal labels match the CLI flags. The plain-language equivalents are for external communication. Use both consistently — the internal labels preserve technical precision; the audience labels preserve clarity.