Results | Ski Chalet · Project Phoenix

Four Results

1

Raw local behavior is not enough lower boundary

Recent local-mode evidence shows that raw local Gemma is too weak to stand in for useful offline domain work. On fixed domain question sets, the raw model produces answers that are too soft, too incomplete, and too unanchored to be reliable.

The bare-mirror version of the ski-chalet fantasy — one machine, one model, no structure, no harness — gets a firm practical answer: no.

This is still a useful result. It defines the lower boundary precisely. The question is not whether raw local works — it does not. The question is what is needed above it.

2

A harnessed local mode is materially better positive claim

Grounded Gemma performed much better than raw Gemma on the fixed tennis domain question set:

Dimension	Raw Gemma	Grounded Gemma
Specificity	weak	stronger
Alignment to domain	weak	stronger
Answer omission rate	higher	lower
Wrong-or-missing answers	non-zero	zero (grounded)

This is the first real positive ski-chalet result. It shows that useful offline answering emerges once verified local harness support exists. The harness — not a model upgrade — is what produces the improvement.

3

The V4 rows exposed the real failure mode diagnostic

The most interpretable V4 ski-chalet evidence showed a specific, recoverable failure pattern:

The local layer kept user intent ✓
It drifted off the Phoenix tool surface on the first pass ✗
One structured redirect was enough to recover ✓
Final answers were grounded and useful ✓

The weakness was not total failure. It was tool-surface drift and local instability without enough control pressure. That is a much more tractable problem than total failure — it is a control-layer problem, not a model-quality problem.

One redirect recovered the session. That is the control layer doing its job.

4

The practical variable is harness level, not model size key finding

The current evidence supports a simple interpretation:

more preparation → better odds more harness → better odds stronger grounding → better odds better auditability → better odds clearer controls → better odds bigger model alone → limited improvement without the above

The odds of offline success rise with preparation, grounding, auditability, and control — not only with a stronger raw model. This shifts the investment question: instead of "should I upgrade the model?", the more productive question is "should I improve the harness?"

Interpretation

The ski-chalet result is best understood as a harness boundary result.

When the question is framed as "can a raw local model do this alone?", the answer is no. That framing was useful for establishing the lower boundary, but it is not where the interesting result lives.

When the question is framed as "can a prepared local system with a working harness do this?", the answer is yes — for Tier 2 operations with the full four-layer harness present.

That reframing is not softening the claim. It is making it more honest and more useful. The original thought experiment imagined a blank-box scenario that mostly does not occur in practice. A local-LLM hobbyist who owns a 3090 is very likely to already have Ollama, a model, and some local project structure. The portable bundle is an incremental addition to that existing setup, not a heroic leap.

What This Paper Does Not Claim

That a blank random machine can do this
That a raw local model is enough
That the setup is ordinary consumer behavior
That every offline environment will have the same success profile
That the current harness level is final or optimal
That the strictest possible version of the original thought experiment passed — it mostly yields a useful "no," and that stricter lane should be kept conceptually distinct

Why It Matters

This matters because it gives local inference a more honest evaluation frame.

Instead of asking: can the model do it alone?

The better question is: what local preparation and harness level are enough to preserve useful work when the cloud disappears?

That is more relevant to real resilience and continuity than a pure prompt-only test. It is also the question that a builder can actually act on — because preparation, harness level, and bundle contents are all things you can improve before you leave for the chalet.

Working Conclusion

The ski-chalet experiment should be explained as a prepared offline continuity scenario — not as a raw-local-model miracle.

A local 3090 setup can remain useful offline when it is supported by a defined harness of deterministic data, grounding, control, and auditability. That is the clearest current ski-chalet result.