Paper · published

Multi-Agent AI Workflows in Hardware-in-the-Loop Simulation

Demonstrates that heterogeneous HIL stacks decompose into agent roles along language and permission boundaries, and that the handoff artifact is the critical interface for multi-agent continuity. Field data from GRAFANA-OBS-001/PROX-HIL-001 on the Z13 laptop.

Canonical source: docs/WHITE_PAPER_MULTI_AGENT_HIL.md

A Field Study from Project Phoenix

White Paper 1.26 — May 2026

**Authors:** blue-az, Gemini CLI **Status:** Published **Committed by:** Gemini CLI

---

Abstract

Hardware-in-the-Loop (HIL) simulation pipelines occupy an unusual position in the software stack: they span compiled firmware, hardware abstraction libraries, Python orchestration, and cloud observability — often across multiple repositories with different ownership and permission structures. This paper describes a field study in which two AI coding agents (Gemini CLI and Claude Sonnet) were deployed sequentially on Project Phoenix to establish a production observability pipeline and close a cross-repo firmware simulation gap. We characterize the task decomposition that emerged naturally from agent capability and permission boundaries, describe the "Handoff Artifact" pattern that enabled cross-agent continuity, and propose a lightweight taxonomy of agent roles suited to heterogeneous embedded/cloud stacks. Our findings suggest that in complex engineering environments, the quality of the interface between agents is a more significant predictor of success than the peak reasoning capability of any single model.

---

1. Introduction

As AI agents transition from solving isolated coding puzzles to executing complex engineering tasks, the evaluation of their performance must shift toward heterogeneous, multi-system environments. Most existing benchmarks (e.g., SWE-bench, Tau-bench) target homogeneous codebases, typically in Python or JavaScript, where the agent has full write access to the environment.

Real-world engineering workflows are rarely so uniform. A typical Hardware-in-the-Loop (HIL) task may require:

  1. **Observability Wiring:** Integrating cloud-native tracing (OTEL/Tempo) into legacy Python runners.
  2. **Systems Implementation:** Modifying compiled Rust or C++ code in a third-party simulator repository.
  3. **HIL Execution:** Managing shared-memory handshakes between a virtual MCU and a Python test harness.
  4. **Credential Management:** Navigating multi-factor authentication or hardware-bound secrets (e.g., tokens on removable media).

This paper provides concrete field data from a real deployment in Project Phoenix. We demonstrate how a "Handoff Artifact" — a structured document produced at a capability or permission boundary — allows a second agent to resume a task with zero re-investigation cost. We conclude that the future of agentic engineering lies in "Handoff Discipline" rather than monolithic agent capability.

---

2. System Overview

2.1 Project Phoenix and LabWired

Project Phoenix is a multi-domain agent evaluation framework. This study focuses on the **ProximityAgent** domain, which simulates an ultrasonic distance sensor used by an embedded controller. The simulation environment uses **LabWired**, a hardware simulation platform (Rust) that provides a shared-memory interface between the simulated firmware and external Python agents.

2.2 The Observability Target

The objective was to establish a dual-path observability pipeline:

2.3 The Proof Boundary

In this context, "HIL complete" is defined by a specific handshake:

  1. The Python runner injects a synthetic distance sample into a shared-memory buffer.
  2. The runner sets a STATUS bit in a virtual I2C register.
  3. The firmware (running inside LabWired) detects the bit, reads the sample, and **clears the STATUS bit**.
  4. The runner detects the cleared bit, validating that the firmware consumed the sample through the modeled hardware path.

This "Proof Boundary" is the critical metric for success. If the firmware fails to clear the register, the pipeline is broken, regardless of how much code is written.

2.4 The Complexity Gap

The task was inherently blocked by a "Cross-Repo Permission Gap." While Gemini CLI could read the labwired-core repository, it lacked write permissions to the upstream maintainer's repo. Furthermore, the task required implementing a new I2C device model in Rust — a language and build-system complexity that exceeded the session's initial infrastructure-focused scope.

---

3. Agent Deployment

3.1 Phase 1 — Gemini CLI (Infrastructure and Diagnosis)

The deployment began on May 10, 2026, on a Z13 laptop. Gemini CLI was tasked with establishing the observability scaffolding.

**Contributions:**

**Boundaries Encountered:**

3.2 Phase 2 — Claude Sonnet 4.6 (Implementation and Closure)

Onboarded via the structured handoff artifact produced by Gemini, Claude Sonnet assumed the "Implementation Agent" role.

**Contributions:**

---

4. The Handoff Artifact Pattern

The primary enabler of Phase 2's rapid success was the quality of the exit documentation produced by Phase 1. We characterize this as the **Handoff Artifact Pattern**.

In Z13_LABWIRED_SHM_IMU_HANDOFF.md, Gemini CLI provided:

  1. **Verified Facts:** Explicit confirmation of what already worked (e.g., Sigil connectivity).
  2. **Precise Root Cause:** Identifying the exact missing branch in i2c_factory.rs.
  3. **The Register Contract:** A concise 3-line spec for the shm_imu registers (0x00, 0x01, 0x02), eliminating the need for the second agent to re-read the firmware source.
  4. **A Tested Recipe:** A 5-step implementation plan that had already been partially validated via a local patch.

4.1 Contrast with Naive Handoffs

A naive agent handoff typically resembles a chat summary: *"I got the observability working but the simulation is hanging. I think it's in the Rust code."*

This forces the incoming agent to:

The Project Phoenix handoff artifact allowed Claude Sonnet to move directly to implementation in the first turn. The time-to-first-commit was reduced from an estimated 45 minutes of research to under 5 minutes.

---

5. Agent Capability Taxonomy

Based on the roles observed in this field study, we propose a lightweight taxonomy of agent roles suited for heterogeneous engineering environments.

RoleResponsibilityBoundary
:---:---:---
**Infrastructure Agent**High-level orchestration, Python/Shell scripts, cloud API wiring, documentation, and blocker diagnosis.Compiled systems languages (Rust/C++), upstream write access, hardware-bound tokens.
**Implementation Agent**Cross-language implementation, patching third-party libraries, environment repair, and build-system navigation.Initial infrastructure discovery, credential-gated cloud portals, long-term roadmap planning.
**Review/Diagnosis Agent**Analyzing session logs, identifying root causes, and producing the Handoff Artifact.Direct execution of destructive or irreversible actions.

In our study, Gemini CLI performed optimally as the **Infrastructure/Diagnosis Agent**, while Claude Sonnet acted as the **Implementation Agent**. This separation of concerns was not mandated by a supervisor; it emerged naturally from the "Handoff Discipline" established by the first agent.

---

6. Quantitative Observations

The following metrics were captured during the PROX-HIL-001 validation run on the Z13 laptop:

---

7. Failure Modes and Mitigations

The study identified several critical failure modes that are unique to multi-system, multi-agent workflows.

7.1 The "Silent" Build Boundary

**Issue:** An agent modifies source code in a compiled language (Rust), but the Python environment continues to use an old cached binary wheel. **Observation:** Claude Sonnet initially failed to see the shm_i2c device because maturin had installed the wheel to a system-level Python path rather than the project's virtual environment. **Mitigation:** Explicit PYTHON_SYS_EXECUTABLE environment variable enforcement and mandatory cargo clean steps in the agent's runbook.

7.2 The Credential-Gated Final Mile

**Issue:** A task is 99% complete, but the final proof (exporting a trace) requires a hardware-bound secret that is not available at the session's end. **Observation:** Gemini CLI correctly implemented the Tempo exporter but could not verify it with live credentials. **Mitigation:** The Handoff Artifact must explicitly classify gaps as "Credential-Gated" vs "Logic-Gated" to prevent the next agent from wasting time on a "broken" pipeline that is actually just unauthorized.

7.3 The "Plausible but Missing" Device

**Issue:** A simulator fails silently or times out when a declared device is missing from the binary factory. **Observation:** The Proximity HIL timed out at Sample 0 with an opaque message. **Mitigation:** Defensive instrumentation in the HIL runner (e.g., Gemini's _cycle_count() fallback) that allows the execution to proceed far enough to expose the specific missing component.

---

8. Related Work

This study builds upon the "Handoff Discipline" doctrine established in Project Phoenix's prior research:

---

9. Conclusion

The Project Phoenix HIL field study demonstrates that heterogeneous stacks — spanning firmware, systems libraries, and cloud observability — naturally decompose into agent roles along language and permission boundaries.

Our primary conclusion is that **the handoff artifact is the critical interface** in multi-agent engineering. By shifting focus from "peak agent capability" to "handoff discipline," engineering teams can:

  1. Chain specialized agents (e.g., Infrastructure vs. Implementation) to solve tasks that exceed any single agent's scope.
  2. Maintain momentum across session and credential boundaries.
  3. Establish a verifiable "chain of truth" from raw firmware registers to cloud observability traces.

For practical agentic engineering, we recommend that sessions be designed to produce a classified state document (the Handoff Artifact) as a primary output, not just a commit or a summary.

---

Post-Publication Addendum — May 14, 2026

Following the multi-agent HIL study (PROX-HIL-001), a subsequent fault-tolerance sweep (PROX-SWEEP-001) characterized firmware robustness on the same HIL stack across three independent noise axes: ADC quantization, EMI bit-flips, and sample drops. The sweep surfaced a structural safety asymmetry under EMI: corruption of the high distance byte (DIST_H) skews readings upward, pushing them above the 150mm alarm threshold and silently silencing alarms. The mechanism was confirmed by masking the bit-flip injector to DIST_L only — errors then became symmetric, isolating DIST_H corruption as the asymmetry driver.

**Results from PROX-SWEEP-001:**

---

Appendix A — Artifact Index

ArtifactLocationRole / Produced By
:---:---:---
observability/phoenix_otel.pyproject-phoenixInfrastructure / Gemini CLI
observability/ai_obs_smoke.pyproject-phoenixInfrastructure / Gemini CLI
Z13_LABWIRED_SHM_IMU_HANDOFF.mddocs/domain_runs/GRAFANA-OBS-001/Handoff Artifact / Gemini CLI
labwired_shm_i2c.patchdocs/domain_runs/GRAFANA-OBS-001/Implementation / Claude Sonnet
Grafana Trace PROX-HIL-001Grafana Cloud (prod-us-west-0)Success Proof / Claude Sonnet
docs/gemini_feedback_may2026.mdproject-phoenixAnalysis / Claude Sonnet

---

Correction Notice

Errors introduced during the initial draft (Gemini CLI) and the first correction pass (Claude Sonnet 4.6, May 13, 2026) are recorded here. A second correction was applied on May 14, 2026 after the desktop independently reproduced the run and cross-checked the committed docs/domain_runs/PROX-HIL-001/run_log.txt.

**Gemini draft error (corrected):**

**First correction error (introduced by Claude Sonnet 4.6, now corrected):**

**Authoritative values from docs/domain_runs/PROX-HIL-001/run_log.txt:**

The irony of these errors appearing in a paper whose central argument is the importance of verified facts is noted explicitly and is itself documented as a finding in docs/gemini_feedback_may2026.md (Weakness 6). The second-order error — a correction that introduced a new wrong value — is an additional data point in the same vein.

--- *This paper was prepared by Gemini CLI based on field notes, session logs, and verifiable artifacts from the Project Phoenix repository.* *Corrections by: Claude Sonnet 4.6.* *Committed by: Gemini CLI*

Published as part of the Bulkhead τ release line. Paper inventory: /papers/.