Results

What the Evidence Says

Four results. One lower boundary. One positive claim. Two things that define the practical variable.

No Raw model alone
Yes Harnessed local
Redirect to recover
Harness Practical variable

These results are a boundary paper, not a final benchmark against every possible local setup. They establish the lower bound, the positive claim, the dominant failure mode, and the variable that matters most.

Four Results

1

Raw local behavior is not enough  lower boundary

Recent local-mode evidence shows that raw local Gemma is too weak to stand in for useful offline domain work. On fixed domain question sets, the raw model produces answers that are too soft, too incomplete, and too unanchored to be reliable.

The bare-mirror version of the ski-chalet fantasy — one machine, one model, no structure, no harness — gets a firm practical answer: no.

This is still a useful result. It defines the lower boundary precisely. The question is not whether raw local works — it does not. The question is what is needed above it.

2

A harnessed local mode is materially better  positive claim

Grounded Gemma performed much better than raw Gemma on the fixed tennis domain question set:

Dimension Raw Gemma Grounded Gemma
Specificity weak stronger
Alignment to domain weak stronger
Answer omission rate higher lower
Wrong-or-missing answers non-zero zero (grounded)

This is the first real positive ski-chalet result. It shows that useful offline answering emerges once verified local harness support exists. The harness — not a model upgrade — is what produces the improvement.

3

The V4 rows exposed the real failure mode  diagnostic

The most interpretable V4 ski-chalet evidence showed a specific, recoverable failure pattern:

  • The local layer kept user intent
  • It drifted off the Phoenix tool surface on the first pass ✗
  • One structured redirect was enough to recover
  • Final answers were grounded and useful

The weakness was not total failure. It was tool-surface drift and local instability without enough control pressure. That is a much more tractable problem than total failure — it is a control-layer problem, not a model-quality problem.

One redirect recovered the session. That is the control layer doing its job.

4

The practical variable is harness level, not model size  key finding

The current evidence supports a simple interpretation:

more preparation → better odds more harness → better odds stronger grounding → better odds better auditability → better odds clearer controls → better odds bigger model alone → limited improvement without the above

The odds of offline success rise with preparation, grounding, auditability, and control — not only with a stronger raw model. This shifts the investment question: instead of "should I upgrade the model?", the more productive question is "should I improve the harness?"

Interpretation

The ski-chalet result is best understood as a harness boundary result.

When the question is framed as "can a raw local model do this alone?", the answer is no. That framing was useful for establishing the lower boundary, but it is not where the interesting result lives.

When the question is framed as "can a prepared local system with a working harness do this?", the answer is yes — for Tier 2 operations with the full four-layer harness present.

That reframing is not softening the claim. It is making it more honest and more useful. The original thought experiment imagined a blank-box scenario that mostly does not occur in practice. A local-LLM hobbyist who owns a 3090 is very likely to already have Ollama, a model, and some local project structure. The portable bundle is an incremental addition to that existing setup, not a heroic leap.

What This Paper Does Not Claim

  • That a blank random machine can do this
  • That a raw local model is enough
  • That the setup is ordinary consumer behavior
  • That every offline environment will have the same success profile
  • That the current harness level is final or optimal
  • That the strictest possible version of the original thought experiment passed — it mostly yields a useful "no," and that stricter lane should be kept conceptually distinct

Why It Matters

This matters because it gives local inference a more honest evaluation frame.

Instead of asking: can the model do it alone?

The better question is: what local preparation and harness level are enough to preserve useful work when the cloud disappears?

That is more relevant to real resilience and continuity than a pure prompt-only test. It is also the question that a builder can actually act on — because preparation, harness level, and bundle contents are all things you can improve before you leave for the chalet.

Working Conclusion

The ski-chalet experiment should be explained as a prepared offline continuity scenario — not as a raw-local-model miracle.

A local 3090 setup can remain useful offline when it is supported by a defined harness of deterministic data, grounding, control, and auditability. That is the clearest current ski-chalet result.