This paper analyzes the text-to-image prompt engineering characteristics of the local gemma4:12b and gemma4:26b models. Through three distinct evaluation briefs (a satirical corporate cartoon, a tennis racket physics infographic, and a humorous golf swing caricature), we establish a consistent, model-size-dependent stylistic divergence. While gemma4:12b favors narrative-driven, conversational prose with multiple format choices, gemma4:26b produces dense, tag-heavy engineered prompts containing explicit technical rendering directives. We detail how this behavioral contrast reflects their relative stop-token and instruction-adherence discipline, providing a blueprint for selecting local models in creative asset pipelines.
---
In structured multi-agent workflows, local models are increasingly used to generate creative assets (prompts, textures, interface sketches) for downstream diffusion models (Stable Diffusion XL via ComfyUI). A key question for pipeline design is whether local models are capable of translating abstract semantic requirements into high-fidelity image prompts, and how model size affects prompt engineering quality.
This paper builds on the findings of **Paper 1.20** (Handoff Discipline) and **Paper 1.25** (Orchestration vs. Reasoning) by extending our evaluation to prompt generation. We contrast the generation styles of gemma4:12b (7.6 GB footprint) and gemma4:26b (17.0 GB footprint) across three specific domain briefs.
---
We executed a comparison matrix using a structured system prompt directing each model to write detailed positive prompts for Stable Diffusion XL. The models were evaluated across three briefs:
*Execution Script: scripts/compare_prompt_styles.py* *Logged Responses: docs/domain_runs/PROMPT_STYLE_COMPARISON_002.md*
---
* **gemma4:12b**: Wrote a single, highly narrative paragraph. It described the scene in descriptive English sentences: *"A satirical political cartoon in the style of a classic newspaper ink drawing, featuring a massive, bloated corporate elephant... awkwardly perched atop a tiny, flimsy child's tricycle..."* * **gemma4:26b**: Wrote a comma-separated list of dense artistic tags: *"Editorial political cartoon, black and white ink illustration, high contrast monochrome, a massive bloated corporate elephant... sharp satirical caricature, exaggerated proportions, intricate pen and ink crosshatching..."* * **Visual Result**: The 26b prompt successfully guided Stable Diffusion XL to render a tricycle. The 12b model's narrative style led to key semantic bleed, resulting in a standard bicycle instead of a tricycle.
* **gemma4:12b**: Provided three separate conversational "options" (Infographic, 3D Technical, Schematic) wrapped in helpful introductory prose and concluding tips on Stable Diffusion sampling settings. * **gemma4:26b**: Lean and highly engineered. It provided two options, but immediately backed them with an **Engineering Breakdown** explaining why specific phrases (e.g. *"volumetric studio lighting"*, *"subsurface scattering"*) were chosen to force realistic textures. * **Style Contrast**: 12b treated the task as an advisory interaction (conversing with a user); 26b treated the task as a precise engineering problem, focusing entirely on token structure.
* **gemma4:12b**: Again provided a conversational list of three stylistic directions (Pixar 3D, Classic Comic 2D, Hyper-Realistic Photo) using full descriptive sentences. * **gemma4:26b**: Outputted a single, heavily optimized prompt utilizing precise anatomical keywords (*"spine curved like a question mark,"* *"knees buckling"*), followed by a professional engineering breakdown of lighting, aspect ratio, and texture parameters.
---
Our evaluation reveals three clear behavioral distinctions between the model sizes:
* The **12b model** writes prompts like a human describer, using standard English syntax. While highly readable, standard grammar introduces extra connector tokens (e.g., "in the style of a," "featuring a") that dilute the attention weights of the CLIP text encoder. * The **26b model** mimics professional prompt engineering practices, using comma-separated keywords and technical modifiers (e.g., *Octane Render*, *volumetric lighting*, *macro close-up*). This aligns much closer to the dataset distributions used to train text-to-image models.
* gemma4:12b includes significant preamble and postamble (e.g., *"To get the best results..."*, *"Expert Tip for Success..."*). This conversational overhead is a liability in machine-to-machine pipelines where outputs must be parsed programmatically. * gemma4:26b exhibits superior stop-token discipline. Its explanations are not conversational filler; they are structured breakdowns of prompt semantics.
* In Brief 1, the 26b model's tag-based prompt preserved the distinction between a tricycle and a bicycle, whereas the 12b model lost this detail in its narrative sentences. The 26b model's structured layout translates directly to higher accuracy in the generated image.
---
gemma4:26b is the superior choice. Its tag-heavy output requires less parsing and maps cleanly to CLIP.gemma4:12b is valuable due to its conversational diversity and multiple style suggestions.