Four Results
Raw local behavior is not enough lower boundary
Recent local-mode evidence shows that raw local Gemma is too weak to stand in for useful offline domain work. On fixed domain question sets, the raw model produces answers that are too soft, too incomplete, and too unanchored to be reliable.
The bare-mirror version of the ski-chalet fantasy — one machine, one model, no structure, no harness — gets a firm practical answer: no.
This is still a useful result. It defines the lower boundary precisely. The question is not whether raw local works — it does not. The question is what is needed above it.
A harnessed local mode is materially better positive claim
Grounded Gemma performed much better than raw Gemma on the fixed tennis domain question set:
| Dimension | Raw Gemma | Grounded Gemma |
|---|---|---|
| Specificity | weak | stronger |
| Alignment to domain | weak | stronger |
| Answer omission rate | higher | lower |
| Wrong-or-missing answers | non-zero | zero (grounded) |
This is the first real positive ski-chalet result. It shows that useful offline answering emerges once verified local harness support exists. The harness — not a model upgrade — is what produces the improvement.
The V4 rows exposed the real failure mode diagnostic
The most interpretable V4 ski-chalet evidence showed a specific, recoverable failure pattern:
- The local layer kept user intent ✓
- It drifted off the Phoenix tool surface on the first pass ✗
- One structured redirect was enough to recover ✓
- Final answers were grounded and useful ✓
The weakness was not total failure. It was tool-surface drift and local instability without enough control pressure. That is a much more tractable problem than total failure — it is a control-layer problem, not a model-quality problem.
One redirect recovered the session. That is the control layer doing its job.
The practical variable is harness level, not model size key finding
The current evidence supports a simple interpretation:
The odds of offline success rise with preparation, grounding, auditability, and control — not only with a stronger raw model. This shifts the investment question: instead of "should I upgrade the model?", the more productive question is "should I improve the harness?"
Interpretation
The ski-chalet result is best understood as a harness boundary result.
When the question is framed as "can a raw local model do this alone?", the answer is no. That framing was useful for establishing the lower boundary, but it is not where the interesting result lives.
When the question is framed as "can a prepared local system with a working harness do this?", the answer is yes — for Tier 2 operations with the full four-layer harness present.
That reframing is not softening the claim. It is making it more honest and more useful. The original thought experiment imagined a blank-box scenario that mostly does not occur in practice. A local-LLM hobbyist who owns a 3090 is very likely to already have Ollama, a model, and some local project structure. The portable bundle is an incremental addition to that existing setup, not a heroic leap.
What This Paper Does Not Claim
- That a blank random machine can do this
- That a raw local model is enough
- That the setup is ordinary consumer behavior
- That every offline environment will have the same success profile
- That the current harness level is final or optimal
- That the strictest possible version of the original thought experiment passed — it mostly yields a useful "no," and that stricter lane should be kept conceptually distinct
Why It Matters
This matters because it gives local inference a more honest evaluation frame.
Instead of asking: can the model do it alone?
The better question is: what local preparation and harness level are enough to preserve useful work when the cloud disappears?
That is more relevant to real resilience and continuity than a pure prompt-only test. It is also the question that a builder can actually act on — because preparation, harness level, and bundle contents are all things you can improve before you leave for the chalet.
Working Conclusion
The ski-chalet experiment should be explained as a prepared offline continuity scenario — not as a raw-local-model miracle.
A local 3090 setup can remain useful offline when it is supported by a defined harness of deterministic data, grounding, control, and auditability. That is the clearest current ski-chalet result.