GPU hardware selection for local image generation: a log

Last year I sat between three machines and one ticking meter. A Mac M4 Max on the desk, a Windows tower with an aging RTX 3070 in the corner, and a fal.ai bill that kept tapping me on the shoulder. The pipeline needed thousands of bespoke hero images for this site. The question was not whether to run image generation. The question was where to run it, and what hardware was honestly worth buying. This is the log of how I picked, with actual prices and the break-even point where the API stops being the cheap answer.

The fork: Mac, RTX 4090, used RTX 3090, or stay on the API

I already owned a Mac and a Windows tower, which biased the question. If I had been buying everything from scratch, the answer would have been different (more on that below). What I had to decide was whether to keep paying fal.ai per image, build a pipeline against the Windows box I already had, or upgrade either machine to make local generation comfortable.

Volume drove the math. I needed roughly 200 unique prompts a month for the next quarter, plus rerolls. A real number, but not so large that throwing money at it was obvious.

Four candidates were on the bench. The Mac M4 Max alone running mflux. The Mac plus a new RTX 4090 in a fresh Windows build. The Mac plus a used RTX 3090 in the existing tower. Or staying entirely on fal.ai or Replicate. Each had a different story for cost, speed, model coverage, and the quiet line item nobody puts on the spec sheet: how much your time costs when the machine misbehaves.

Single jagged glass fragment isolated on a dark studio backdrop, broken edge under cold electric-blue rim, faint hot-pink dispersion bleeding across the inner face. — // the fragment · broken edge under twin rim lights

The Mac M4 Max as a generation machine

The M4 Max is a wonderful prompt and review machine. Writing prompts, captioning, ledger review, design QA, the MDX in your editor, all of it stays fast. Where it stumbles is at the diffusion side, and only when you push past one image at a time.

mflux is the native MLX port of Flux that runs Schnell and dev locally on Apple silicon. On an M4 Max with 64GB unified memory, Flux Schnell at 4 steps and 1344x768 takes about 30 to 40 seconds warm. Honest performance for a laptop chip. It is also the fastest the Mac will go, because two mflux processes in parallel push the unified memory architecture into aggressive paging, swap fills, and the machine becomes unresponsive for a minute or more. I shipped the throttle that keeps an M4 Max from paging itself into the floor under Flux specifically so it would stop happening.

The Mac also caps out at one model family. mflux runs Flux. It does not run Z-Image Turbo or Qwen Image. For a pipeline that routes different article archetypes to different models, that ceiling matters more than wall-clock seconds.

The base M4 Max with 64GB memory was around $3,200 at evaluation time. If you already own one for design or development work, the marginal cost for image generation is zero. Buying a Mac specifically for image work spends a lot on a serial-only, Flux-only rig.

RTX 4090 new, RTX 3090 used, or no GPU at all

The Windows side of the question is where the GPU hardware selection for image generation gets interesting. Three branches mattered: a new RTX 4090, a used RTX 3090, and a no-purchase option leaning on the API for whatever an 8GB card cannot run.

The RTX 4090 with 24GB VRAM ran $1,800 to $2,200 new during the evaluation window. It runs Flux dev at full resolution and every quantized variant of every diffusion model shipped recently. Wall-clock per image stops mattering in a solo workflow. It also comes with a warranty, which the secondary market does not.

The RTX 3090 with 24GB VRAM ran $700 to $900 on the used market in the same window. Same 24GB as the 4090. Meaningfully slower (call it 60 to 70 percent of 4090 speed on Flux dev), but at solo volume that gap rarely costs a real minute. The risks are real: secondary-market 3090s are often former mining cards that have run hot for years, sellers lie about hours, and the fans are usually the first thing to go. No warranty, no return path.

The no-GPU option keeps the API on the table. fal.ai ran $0.04 to $0.05 per Flux Schnell image at volume during my evaluation. Replicate was in the same band. You pay nothing when you are not generating, which matters more than the per-image rate when volume is low or bursty.

The 8GB cards deserve a brief mention because that is what I actually have. An RTX 3070 8GB will run Z-Image Turbo at 1024x1024 in roughly 18 seconds warm and Flux Schnell GGUF quantized variants. It does not run Flux dev at full quality. If you are buying anything new, skip the 8GB tier.

Wide atmospheric haze over a dim plain, single distant translucent monolith barely visible through electric-blue mist, hot-pink ember on the far horizon. — // the haze · monolith barely visible through mist

The break-even math at 2026 prices

Numbers I trust because I ran them on my own pipeline.

At 200 images per month, fal.ai at $0.04 per image is $96 per year. At $0.05 per image it is $120 per year. That is the API line.

A new RTX 4090 at $2,000 amortized over 12 months is $167 per month and never breaks even at this volume in year one. Over 36 months it drops to $56 per month, which means break-even somewhere around 1,400 images per month sustained. That is well past what a solo practice runs.

A used RTX 3090 at $800 amortized over 24 months is $33 per month, or $400 of year-one cost. At 200 images per month it pays back the API in about 8 months. At 300 it pays back in 5. The 3090 is the only consumer card whose math works at solo volume in year one.

Electricity is real but not a tiebreaker. The Windows box draws roughly 350W under render and idles around 80W. At 12 cents per kWh and two hours of rendering per day, rendering electricity is about $30 per year. Idle power, if the machine is on 24/7, is closer to $80 per year. Wake-on-LAN before a batch drops idle to near zero.

The line item nobody costs in is time spent fighting the machine. The hours I lost to Mac swap-thrash incidents before the wrapper landed were worth more than any GPU. Hours lost to a 3090 with dying fans would be the same trade running the other direction. Time is the volatile cost in this calculation, not the parts.

Macro close-up of crystalline dispersion on a single glass slab, refractive bands of cold blue splitting into hot-pink across the chipped corner, fine surface detail. — // the dispersion · refractive bands close up

What I chose and why

I chose the path with the lowest dollar cost, which was the RTX 3070 already in the Windows tower I owned, dispatched to from the Mac over SSH. That zeroed out the hardware decision and let me ship the pipeline against real constraints. The dual-machine wiring I settled on after the Mac kept thrashing covers the actual SSH and ComfyUI plumbing.

The 8GB VRAM is a real constraint. Z-Image Turbo runs comfortably, Flux Schnell GGUF runs slowly but works, Flux dev at full quality does not run. The tradeoff is acceptable for the model mix I use, which leans on Z-Image for single-subject hero work and pushes layered prompts to Flux Schnell. The model-by-model quality comparison I run both options against is the hub article where the Z-Image versus Flux choice gets explicit.

If I were buying a GPU today for this pipeline, I would buy a used RTX 3090 24GB. The 24GB VRAM removes the model-selection ceiling. The price gap to a 4090 buys two years of cloud API for anything I cannot run locally. And the 3090 has been on the market long enough that the surviving cards have proven they survive; the early-burnout failures already happened.

If I were buying a Mac today specifically for image generation, I would not. The unified memory ceiling keeps moving up but the parallelism ceiling does not. A Mac is the right machine for the prompt and review side. It is not the right machine for the diffusion side at any price point I can rationalize.

Ultra-wide backlit silhouette of a single tall translucent slab against a deep electric-blue evening sky, hot-pink magic-hour band along the low horizon. — // the silhouette · slab against deep dusk sky

What I would revisit

The decision is dated, so the conditions that change it are worth writing down.

A new diffusion model that needs more than 24GB VRAM shifts the answer toward cloud rental or a 4090, depending on frequency. I have seen this threaten on Flux dev at full resolution, where headroom on 24GB cards is already thin.

A 96GB or larger unified memory Mac that runs Flux dev in parallel without thrashing changes the Mac story entirely. If a future M-series chip handles parallel diffusion gracefully, the dual-machine pipeline becomes redundant.

Used 4090 prices dropping below $1,200 collapses the 3090 case. As 5090s ship and creators upgrade, secondary 4090s will drop into 3090 territory. When they do, the 4090 used becomes the obvious choice.

A fal.ai or Replicate price drop below $0.02 per image makes the API the right answer for almost any solo workload. Inference economics keep improving.

“
Time is the volatile cost in this calculation, not the parts.
”

Volume itself is the other variable. The math assumed 200 prompts per month. If the programmatic content scale that justifies any of this pushes me to a thousand per month, the 4090 case opens up and the API math collapses. Agent orchestration patterns where the same machine-split logic shows up push the same buy-versus-rent calculation in directions you do not expect.

Frequently asked questions about GPU hardware selection for image generation

Why not just rent a cloud GPU by the hour?

RunPod and Vast.ai rent RTX time at reasonable hourly rates. The math flips against owning once you account for setup overhead per session, model weight downloads, and the friction of remembering to spin a pod down. For a daily-use solo pipeline, owning a used 3090 was cleaner. For sporadic-use studios with unpredictable schedules, hourly cloud is probably the right call.

Is the M4 Max ever the right answer alone?

For low volume on Flux only with the safety wrapper in place, yes. Under roughly 50 images per month and willing to serialize, the Mac alone holds up fine. It stops being enough the moment you want a model mflux does not run, or the moment your volume crosses into wanting parallel renders.

Does an 8GB card still have a job in 2026?

Mine does, because I already owned it. An RTX 3070 or 3060 Ti runs Z-Image Turbo and quantized Flux Schnell at hero-image resolutions. It does not run Flux dev at full quality. If you already own an 8GB card, build the pipeline against it before you spend money. If you are buying anything new, skip the 8GB tier.

Used RTX 3090 versus new RTX 4070 Ti Super, which is the real bench choice?

The tightest comparison on the bench. The 4070 Ti Super has 16GB of VRAM and is faster per watt, but 16GB is the wrong side of the model-selection cliff for Flux dev and any future model that pushes past 16. The 3090 has 24GB at half the price used. For image generation, VRAM beats raw speed in 2026. I would take the 3090.

Does this math change if I also do video diffusion?

Yes, materially. Video diffusion eats VRAM and time at a different scale. A 24GB card is the floor for serious video work, and even a 4090 is tight on longer clips. If video is on the roadmap, the 4090 case opens up early. The API also gets harder to lean on because video pricing per second wipes out per-image savings quickly.

Does the cost-payback math hold for non-glass aesthetics?

The dollar math holds because pricing is per image, not per style. Model selection underneath might change. The hardware decision is upstream of the model decision. Buy enough VRAM to run whichever models your aesthetic actually wants, and decide aesthetic on the model side.

Takeaways

GPU hardware selection for image generation at solo volume is a four-way fork: Mac M4 Max alone, RTX 4090 new, RTX 3090 used, or staying on the API. Volume and existing hardware drive which branch is honest.
The API stays cheaper than buying any GPU until you cross roughly 1,000 images per year. Below that, fal.ai or Replicate is the right answer and the hardware case is hobbyist.
A used RTX 3090 24GB is the only consumer card whose year-one math works at solo volume. The 3090 used market also has the deepest selection of cards that have already survived their early-burnout window.
The Mac M4 Max is a great prompt and review machine and a limited generation machine. Parallel diffusion thrashes unified memory, and mflux is Flux-only. Use the Mac for the creative side and dispatch generation to a discrete GPU.
Time is the volatile cost in this calculation, not the parts. Hours lost to a thrashing Mac, a dying used card, or a flaky cloud API are worth more than the spec-sheet differences.

Sources and specifics

Dual-machine pipeline shipped the bespoke hero image inventory for this site across Q1 and Q2 2026. Roughly 200 unique prompts per month plus rerolls. Per-slug feedback ledger and runner JSON are version-controlled on disk.
M4 Max with 64GB unified memory used for prompts, captioning, design review, and the writing toolchain. Windows tower with an RTX 3070 8GB used for diffusion via Pinokio and ComfyUI. Dispatch is plain SSH with a dedicated key.
Reference benchmarks: Z-Image Turbo at roughly 17.8 seconds warm at 1024x1024 on the RTX 3070; Flux Schnell at roughly 30 to 40 seconds warm on the M4 Max via mflux; cold starts add about 30 seconds either way.
Cloud pricing during evaluation: fal.ai roughly $0.04 to $0.05 per Flux Schnell image at volume. Replicate in the same band. Varies by provider, model, and contract.
Hardware pricing during evaluation: new RTX 4090 24GB roughly $1,800 to $2,200; used RTX 3090 24GB roughly $700 to $900 on the secondary market; Mac M4 Max 64GB starting around $3,200 new.
Break-even math assumes 200 images per month, 12 and 36-month amortization horizons, 12 cents per kWh electricity, and two hours per day of active rendering. Substitute your own numbers; the framework is the part that travels.
Pipeline output is visible across the hero images on the case studies the hardware was bought against. The productized version of the broader stack is at the productized version of this stack.