Prompt engineering feedback taxonomy: the ledger format

Across enough image-generation projects, a pattern shows up. Around generation 400 of any one of them, the operator stops trying to write better prompts in a vacuum and starts trusting the small file they have been writing reviews into. That file is the actual prompt engineering feedback taxonomy for the project. It looks nothing like the general advice you get from a prompt-engineering blog post.

I have watched this happen across three projects on my own desk now. The shape repeats, but the contents do not.

The pattern: review verdicts become prompt grammar

The pattern is simple to describe and a little harder to believe until you have run it. You keep a per-project file that records, for every generation round, a verdict (GOOD, MAYBE, BAD) and a free-text note about why. After a few hundred entries, the notes start clustering. The same complaints come up against the same kinds of prompts. The same wins come up against the same kinds of subjects. You write a regex pass that pulls recurring fragments out of the notes (things like "tighter in the frame", "single subject only", "no animal-named geography") and feeds them back into the next prompt as grammar.

What you end up with is a taxonomy, not a list of patterns to copy: this category of prompt at this rate of success on this model in this pipeline. The taxonomy is yours. It will not match anyone else's, and it will not survive a model swap intact.

Three instances below. Two are pipelines I run on this site for hero images, and a third is a smaller batch that crossed both models and made the case for the per-project ledger more cleanly than either of the larger ones did. The format is the same across all three; the content under the format is not.

Tight macro of a single ice shard with sharp crystalline fractures, refractive surfaces catching cold light. — // the shard close in · fractures, cold light

Instance 1: a hero-image batch on Z-Image Turbo

The first instance is the bespoke hero pipeline on this site, running mostly on a Windows render box with Z-Image Turbo via the Pinokio Z-Fusion package. Around 500 generations across 21 review rounds. About 100 article slugs in the active queue. The ledger started as a memory aid because I kept making the same mistakes round after round, and stabilized into a usable taxonomy somewhere around round 10.

Three positive categories hardened. Atmospheric single-subject landscapes hit 77 of 150, a 51% rate, with winners that share the same shape: one dominant natural form, no people, no foreground architecture, simple light. Monoliths in nature came in at 27 of 110, a 25% rate, tightly clustered around specific placements like cliff edges, glacier peaks, and mirror lakes. Dune-style work split into three buckets (desert landscapes, palace interiors, planetary orbital views) and landed at 26 of 66, 39%. None of the wins included multiple subjects.

Three negative categories also hardened, and in operator economics they matter more than the wins. Spacecraft never landed (0 of 14). Hooded humans in portals or caves rendered cleanly twice in fifteen attempts. Looks-like metaphors (dragon-spine ridges, kraken islands, sky whale skeletons) failed because the model reads the animal name and renders the literal animal. After those failures were named in the ledger, I stopped wasting overnight render time on them.

No general best-practices article would have told me any of that. The categories are specific to this model, this resolution, this sampler, this hardware.

Instance 2: a Flux Schnell pipeline for layered scene work

The second instance is the same site's secondary pipeline: Flux Schnell on the Mac through mflux for prompts where the scene has depth. Different machine, different model, different category of work. Same ledger format. The full routing logic and where each model wins lives in the Flux Schnell vs Z-Image Turbo comparison piece, which is the hub for everything in this cluster.

The taxonomy that emerged on the Flux side is almost the inverse of the Z-Image one. Layered prompts (back layer, middle layer, foreground subject with its own focal treatment) survive on Flux at roughly three in five for the kinds of work I run. Surface drama, ornate detail, intricate edge work all do better on Flux than on Z-Image. The opposite is also true: simple geometric primitives that win in Z-Image come out a little muddy in Flux, because the model is reaching for richness that the prompt did not ask for.

Failure modes are different too. Flux loses cohesion when the prompt asks for a tight macro shot of a single object with no context. Z-Image loves that prompt. The two models are not interchangeable, and any prompt-grammar advice that does not name the model under the hood is implicitly recommending one of them and miscalibrating the other.

The throttle that keeps this Flux pipeline alive on a 16GB Mac is the same wrapper from the mflux memory safety wrapper post. Without it, batches of layered prompts at full Flux quality crash the whole machine before round five. With it, the pipeline runs reliably and the ledger gets to keep accumulating data instead of falling over every other batch.

The ledger schema for this pipeline is identical to the Z-Image one: JSON, slug-keyed, history array of round-verdict-note triples, plus a flat list of preserved cues. Cue-extraction regex pack is different. Format is portable, content is not.

Wide atmospheric interior of an icy cave at dusk, soft pink and blue ambient haze filling the space. — // the cave at dusk · pink and blue in the haze

Instance 3: the cross-model fantasy batch

The third instance is smaller and the most useful for making the point that the per-project ledger is the unit of analysis. About 40 fantasy-themed prompts, the same prompts run on both Z-Image and Flux Schnell, with verdicts captured in the same ledger but tagged by model. The plumbing for running both pipelines side by side is in the cross-machine setup post.

Three things showed up that I would not have predicted from running either pipeline alone. First, animal-named geography failed in both, but for different reasons. Z-Image renders the literal animal when you write "dragon-spine ridge" and pretends the geography part of the prompt is decoration. Flux honors the geography intent more often, but the result still feels slightly wrong because the model is reaching for animal-evoking shapes instead of leaving the ridge alone. Both bad outcomes, both classified as the same prompt category in the ledger.

Second, hooded figures in portals failed in Z-Image (2 of 15) but survived in Flux at a marginal rate. The Flux versions were not great, but they were usable for filler images. That is a routing decision that lives in the ledger and nowhere else. A blanket "do not prompt hooded figures" rule would have cost me five usable Flux images.

Third, the cross-model rows let me see when a category was failing because of the prompt itself versus because of the model. That distinction is invisible across two separate ledgers and obvious inside one ledger that tags the model on every row.

The reason this small instance matters more than the larger two is that it forces the format into being right. A ledger that cannot record "this prompt category fails on model A but works at half rate on model B" is too narrow. A ledger that can is doing real work.

“
Once you have written down a cross-model row, you stop believing in general prompt-engineering advice forever.
”

Close fragment of split ice with crystalline edges and a deep interior glow at the break. — // fragment of ice · glow inside the break

What the pattern tells us

Three things, in descending order of how surprising they were when I worked them out.

Categories are model-specific in ways that no general advice can encode. The same prompt category lands at very different rates on Z-Image, Flux Schnell, and (the bit of testing I have done so far on) Qwen Image. Any prompt-engineering content that does not name the model under the hood is calibrating against a model that does not exist in your pipeline.

Free-text notes are first-order data. Verdicts are bookkeeping. MAYBE notes carry most of the signal because those are the rerolls you can rescue. BAD notes get dictated fast on a phone in the morning and tend to be sparse, and GOOD notes are often empty because nothing needed saying. If I were rebuilding the tooling I would optimize for capturing the MAYBE note, not for the verdict ratio.

A negative taxonomy saves more batch time than the positive taxonomy makes. Knowing that spacecraft never land in this pipeline saves me six hours of overnight render time per month and six morning reviews. The positive side ("atmospheric landscapes work") just gives me more confidence on prompts I would have written anyway. What makes the loop economically real is the list of categories I refuse to render.

How to spot it early in your own work

Cost of running a ledger is small (about 150 lines of Python plus a JSON file) but the payback curve has a knee in it. I use the second-time mistake on the same slug as the signal to switch from a spreadsheet to a ledger. The moment I make a correction in round four that I had already made and forgotten in round two, the spreadsheet has stopped helping.

Below 50 generations the lessons fit in working memory. Between 50 and 200 a spreadsheet works fine. Above 200 the regex cue extraction starts paying back, and above 500 the negative taxonomy alone justifies the whole tool. Exact thresholds shift with how much variety is in the work, but the underlying signal to watch is always the second-time mistake.

The other thing to watch for is the temptation to use someone else's taxonomy. There are good prompt-engineering blog posts. Most of them are written against models that the author is not currently running, and the rules are too coarse. Take the format if it is useful. Your taxonomy is the one you write down from your own reviews.

Ultra-wide distant view of an ice cave mouth as a small bright opening in a vast cold field. — // the mouth from afar · one point of light in cold field

Why I trust per-project ledgers over the general advice

The general prompt-engineering advice is not wrong, exactly. It is calibrated against a generic model behavior that approximates an average across a half-dozen releases, and the averaging hides the specifics that decide whether your batch ships. Resolution, sampler, step count, distillation, hardware constraints all shift the taxonomy in ways the generic advice cannot describe.

A per-project ledger is narrow on purpose. It will not help anyone else with their pipeline. It will help you with yours, because the unit of analysis (one operator, one model, one configuration) matches the unit of work that actually ships. Generic advice optimizes for being shared; the ledger optimizes for being right inside one project.

The other piece is durability. Models change. Z-Image Turbo will be replaced, Flux Schnell will be replaced, and the cue rules I have written for both will not survive those transitions. What does survive: the ledger format, the free-text notes, and the slug history, because each entry records a specific moment with the model that existed at the time. The only piece of prompt engineering I believe outlasts any one model release is the format you write your reviews into.

If you want to see the rest of the cluster this fits into, the retrospective on the original ledger build covers the tooling and the loop in more detail, and the grammar differences between the two diffusion models I run locally is where the model-specificity argument originally lived. Both are worth reading if you are going to set this up yourself.

Is a feedback ledger different from a prompt library?

Yes, and the difference is the unit. A prompt library stores prompts you want to reuse. A feedback ledger stores verdicts and notes against generations, then derives reusable fragments from them. Most operators end up running both, but the ledger is the one that produces the taxonomy.

How big does the ledger need to be before the taxonomy is real?

The taxonomy starts to firm up around 200 generations and stabilizes around 400 to 500. Below 200 you have observations. Between 200 and 400 you have hypotheses. Above 400 you have categories you can defend with hit rates and reroll cycles to back them up.

Does the taxonomy transfer between models?

The format does. The content does not. When I move a batch from Z-Image to Flux, the regex cue rules break and the hit rates per category shift in ways I cannot predict. The slug history and the free-text notes stay useful because they record specific historical moments. The rules I rewrite.

Why per-project instead of one global ledger across all my work?

Two reasons. The taxonomy is model-specific and most operators run different models for different projects, so a global ledger would have to be partitioned by model anyway. And the per-project frame matches how you actually decide whether a batch is shipping. The unit of work and the unit of memory should be the same unit.

What goes in the free-text note that the verdict alone cannot capture?

The MAYBE rescue. A MAYBE verdict says "this is not done but it is not dead." The note says what would move it to GOOD. That sentence is the actual prompt-engineering data. The verdict is just the bookkeeping that points at it.

When should I skip the ledger entirely?

Below 50 generations the ledger is overhead. If the work is short-run (one campaign, one client deliverable) and you can finish in a single sitting, hold the lessons in working memory. If the work is recurring or the volume crosses 200 generations, build the ledger.

Sources and specifics

Hit-rate figures are from a single operator's review of ~500 generations on Z-Image Turbo across 21 review rounds, plus a smaller Flux Schnell pipeline running for layered scene work on a 16GB Mac through mflux.
Ledger format: a JSON file at the project root, slug-keyed, with a history array of {round, verdict, note} entries plus a flat list of preserved prompt cues.
Cue extraction is roughly 30 regex rules as of the most recent rebuild, regenerated whenever the underlying model is swapped.
Hardware in scope: an RTX 3070 with 8GB VRAM running Pinokio Z-Fusion on Windows for Z-Image Turbo, and a 16GB Mac running mflux for Flux Schnell. Other configurations would shift the taxonomy.
All taxonomy claims are scoped to one operator's pipelines and should not be read as general prompt-engineering advice. The point of the article is that the per-project frame beats the general one.