Four modes. Four different claim boundaries. Never collapse them into one vague category.
The reason mode separation matters is not rigor for its own sake. It is that collapsing all local results into one vague category ("local LLM result") makes every claim uninterpretable. These boundaries are what make results reproducible and trustworthy.
| Mode | What the model sees | Default use | Claim boundary |
|---|---|---|---|
raw |
User request only | Baseline / debug | Does not prove domain usefulness |
grounded |
Request + verified local context | Default user-facing mode | Proves constrained usefulness, not independent reasoning |
artifact |
Answer returned from validated precomputed layer | Stable demos, presentations | Proves validated answer availability, not live execution |
implementation_agent |
Model runs inside deterministic workflow | Escalation / traceability | Proves workflow capability, not raw-model strength |
These are not a quality ranking. They are a claim boundary map. A grounded result is not "better than" a raw result in every context — it is a result with a different, clearer claim about what it proves.
What it is: The local model sees only the user question. No verified context is injected first.
Baseline local-model behavior on its own. Whether the model can answer plausibly without grounding.
Plausible but wrong answers. Omission on list or set questions. Refusal on precise statistical questions.
Baseline and debugging only. Raw results do not justify claims about domain usefulness. If you are tempted to publish a raw result as a domain capability claim, you are miscategorizing the experiment.
What it is: The local model sees the question plus verified domain context first. In TourAgent this means a deterministic answer seed, tool path, and evidence bundle injected into the prompt before the model responds.
Constrained local usefulness. Whether the model can produce a good user-facing answer when facts are already verified. The practical value of grounding without full workflow execution.
Good wording over a fixed validated surface, but no ability to go beyond it honestly. The model is a constrained answer renderer here, not an independent reasoner.
Grounded mode balances usability, rigor, and local feasibility. The model adds presentation quality on top of deterministic correctness. That is a real contribution without overclaiming what the model is doing.
What it is: The system returns an answer from a validated precomputed answer layer. No live tool execution happens at answer time.
Stable validated answer availability. Repeatable presentation of a frozen or overlaid answer surface.
Overclaim if presented as though tools are running live. An artifact answer is fast and stable precisely because it is not running live — that is a feature, not a limitation, but it must be declared.
Stable demonstrations, presentations, frozen validation surfaces. The TourAgent CLI --mode agent flag is actually artifact mode — the internal implementation label and the audience-facing label are deliberately separated.
What it is: The local model operates inside a deterministic workflow. That can include tools, file reads, validation checks, logging, and artifact preservation. The model is one step in a traceable pipeline, not a freeform answerer.
Workflow capability under controlled local constraints. This is the strongest local claim among the current modes — the most meaningful one for capability claims because it shows the model contributing usefully inside a real deterministic system.
Slow recovery on multi-step tasks. Higher scaffolding cost even when the final answer is correct. This mode earns the most trust but costs the most execution time.
When traceability matters. When the answer will be saved as an artifact. When the user needs to know exactly which tools ran and what evidence supported the answer.
For external explanation, the cleanest wording for talks, papers, and demos:
| Internal label | Plain-language equivalent |
|---|---|
raw |
Model alone |
grounded |
Model plus verified context |
artifact |
Validated answer layer |
implementation_agent |
Controlled local workflow |
The internal labels match the CLI flags. The plain-language equivalents are for external communication. Use both consistently — the internal labels preserve technical precision; the audience labels preserve clarity.