GPU swap management when image generation upgrades go wrong

Q: Why does ComfyUI's /free node not actually free VRAM?

The /free node sends a request to the ComfyUI runtime to drop its references to the loaded model. ComfyUI then asks PyTorch to release the underlying tensors, and PyTorch asks the CUDA allocator to return the memory to the GPU. Each of those layers can succeed at its part without the next layer following through. Under Pinokio's managed Python environment, the most common failure is that PyTorch's caching allocator holds the memory in its internal pool rather than returning it to the OS-visible pool. The front-end thinks the model is unloaded, the card disagrees, and nvidia-smi is the one telling the truth.

The model says it is loaded. The card says there is room. The first generation OOMs anyway, and the second one OOMs faster. I had hit the GPU swap management wall on a Windows render box mid-batch, halfway through an article cluster, with twelve images still to render and a queue full of prompts that all wanted a different model than the one resident. The fix turned out not to be a Pinokio setting or a ComfyUI flag. It was a protocol, written down, run every time I want to swap models on the 8GB card.

The incident: Pinokio refused to load the second model

I was on round 28 of the bespoke hero pipeline. The Windows box, an RTX 3070 with 8GB of VRAM running Pinokio with the Z-Fusion ComfyUI bundle, had been running Z-Image Turbo at 1024x1024 for the front half of the batch. Twelve images in, I decided the back half would render better on Flux Schnell GGUF Q4. The two models share roughly the same memory band on this card, around seven gigabytes resident, so the swap should have been simple. Drop one, load the other, continue.

The ComfyUI front-end has a /free node for exactly this. Send a free signal, the model unloads, the next one boots. I queued it. The status bar said the model was unloaded. I dragged in the Flux Schnell GGUF loader and queued the first prompt of the back half.

CUDA OOM. No preview, no partial render, just the red error in the queue panel. I retried with a smaller variant. CUDA OOM again. I retried with the smallest GGUF I had. Same error.

The front-end was telling me the model had unloaded. The error was telling me there was no room to load anything else. Both of those statements cannot be true on a card that just had room for Z-Image Turbo five minutes ago. So one of them was lying, and it was easy enough to figure out which.

I opened a Windows terminal and ran nvidia-smi. The python process from Pinokio was holding 5.1GB of VRAM. Z-Image was nominally gone, but the allocator had not actually given the memory back. The card had less contiguous space free than it would have had at boot. That was the whole problem.

Close-up macro of a circuit board surface with a single pink LED catching dust and the cool blue cast of a monitor off-frame. — // the board · single pink point on silicon

Timeline

Local times, rough but recovered from the queue logs.

T+0 min - Z-Image Turbo running clean at 1024x1024 on the RTX 3070, 7.3GB resident, batch on item 12 of 20. Card is happy.
T+15 min - Decided to upgrade the workflow to Flux Schnell GGUF Q4 for the layered prompts in the back half of the cluster. Different model family, different prompt grammar, different VRAM profile.
T+16 min - Sent /free from the ComfyUI front-end. Status bar reports model unloaded. Front-end is convinced the card is now empty.
T+17 min - nvidia-smi shows the python process at 5.1GB still held. The card sees something the front-end does not.
T+18 min - Queued the first Flux Schnell GGUF generation. CUDA OOM at the first inference call.
T+22 min - Tried two more variants and a smaller resolution. Same OOM, same allocator state. The card is not getting any more empty by retrying.
T+25 min - Opened the Pinokio sidebar. The ComfyUI Python process is still pinned. The front-end can ask, but it cannot make the runtime release.
T+30 min - Stopped ComfyUI from the Pinokio sidebar. Watched nvidia-smi drain. Process disappeared. VRAM dropped to 380MB residual (the desktop compositor and a couple of background apps).
T+32 min - Relaunched Pinokio Z-Fusion. Loaded the Flux Schnell GGUF workflow directly at boot. First generation came through in seven seconds. Finished the back half of the batch with no further incidents.
T+45 min - Wrote the protocol down so I would not lose another twenty minutes to the same allocator state on the next swap.

Root cause: allocator fragmentation under a managed Python

Pinokio runs ComfyUI inside a managed Python environment with its own conda-style isolation. The model weights live in PyTorch CUDA caches behind a few layers of abstraction. The ComfyUI /free node is a request to that runtime, not a guaranteed action on the card. When you send it, ComfyUI tells PyTorch to drop its references to the model. PyTorch then asks the CUDA allocator to release the memory. The allocator, depending on what else has happened in the session, may or may not actually return that memory to the OS-visible pool.

On a card with plenty of headroom, the residual rarely matters. A 24GB card has so much slack that even a fragmented allocator can fit the next model. On 8GB, every megabyte counts. Z-Image Turbo at 1024x1024 sits at roughly 7.3GB resident. That leaves the card with about 700MB of headroom out of the box. After a partial unload, you might have 2 to 3GB nominally free, but the largest contiguous block left could be a few hundred megabytes. The next model's biggest tensor wants more than that contiguous block in one allocation, and the allocator refuses.

This is the difference between "free memory" and "available contiguous memory" that nvidia-smi does not show you cleanly. The tool reports what the process is holding, in aggregate. It cannot show you the per-allocation gaps inside that aggregate. A card can report 1.2GB free at the process level and still fail to allocate a 900MB contiguous block, because the 1.2GB is shattered across dozens of fragments left over from the prior model's runtime.

“
The card is not full. The card is fragmented. Those are different problems, and only one of them has a clean fix.
”

The clean fix is process restart. When you kill the ComfyUI Python process, every allocation it was holding goes with it. The driver returns the memory to the OS, the next launch starts from a clean allocator state, and the next model fits because there is nothing else competing for the largest contiguous block. The dirty fix, swapping inside a session, is the one that wasted my twenty minutes.

Wide atmospheric scene of deep blue haze layered over a low horizon, single warm point glowing at the far edge. — // the haze · single warm point at the edge

What I changed: the forced-restart triage

The change was protocol, not code. A short written rule I run every time I want to swap models on the 8GB card. The rules look obvious in retrospect, which is the usual signal that a fix is real.

Rule 1: never swap models inside a ComfyUI session on an 8GB card

If I am going to render with Z-Image and Flux Schnell in the same batch, I split the batch by model. Z-Image runs to completion, ComfyUI shuts down, Pinokio Z-Fusion relaunches against the Flux workflow, the back half runs. The cost is the relaunch overhead, maybe twenty seconds. The savings is the half hour I do not lose to a fragmented allocator.

On a 24GB card I would not bother with this rule. There is enough headroom on a 4090 or a used 3090 that the fragmentation tolerance covers most swaps. The rule is specific to the 8GB tier, and that band is where most cheap entry pipelines actually run. The economics on this card are documented in the GPU hardware decision that put me on this 8GB card in the first place; the math is not the point of this piece, but it is the reason the protocol matters.

Rule 2: watch nvidia-smi during the restart, not just before

After I kill ComfyUI, I do not just trust the front-end to come back clean. I keep nvidia-smi running in another terminal until the python process is gone and the VRAM reading is below 1GB. The exact one-liner I keep in shell history:

nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader -l 1

The -l 1 flag refreshes once a second. When the python process dies, the used number drops in two stages: first by whatever the front-end was actively rendering, then by the larger model weights and caches as the runtime tears down. If the second drop does not happen within five or six seconds, something else is wrong, and I would rather know about it before I queue the next batch.

Rule 3: if generation OOMs on the first call, do not retry

This is the rule the incident taught me. The first OOM after a model swap is the signal that the allocator is in a state retries cannot recover from. Each retry leaves more allocator state behind, not less. The right move is to skip to the restart immediately. I have lost more time to retrying than to anything else in this whole pipeline.

Rule 4: dispatch one model per Pinokio session

The dispatcher on the Mac side now treats a model switch the same way it treats a machine reboot. The script SSHes into the Windows box, asks Pinokio to stop the running app, waits for the python process to disappear from the GPU, and only then launches the next model's workflow. The whole sequence takes about thirty seconds and never produces a fragmented start.

The dispatcher pattern fits cleanly into the dual-machine setup that runs both ends of the pipeline. The same SSH plumbing that sends prompts and pulls renders also sends the kill signal between model families. One protocol, one transport, two failure modes that are now both handled the same way.

The Mac side has its own version of this same family of problem. Unified memory means model swaps fight the desktop session for the same pool, and the failure mode is swap-thrash rather than allocator fragmentation. The Mac-side wrapper for the unified-memory version of this same problem covers that side. The Windows version in this article and the Mac version live in the same protocol family: do not trust the runtime to clean up after a model swap, and gate the next launch on a known-clean state.

A single fractured piece of crystalline material lit by competing pink and blue light sources, isolated against a dark void. — // the fracture · two lights on a broken piece

What I would do differently

Three things, in order of how much time they would have saved.

First, I would have written this protocol after the first failed swap, not the third. The pattern was visible in the first OOM. I retried twice because the front-end UI told me the model had unloaded, and I trusted the UI more than nvidia-smi. That trust was misplaced for this stack. The runtime is the source of truth, the front-end is a request layer, and on a small card the gap between request and reality is where time goes to die.

Second, I would not have started by tuning Pinokio settings or chasing ComfyUI flags. My first instinct after the third OOM was to look for a "force unload" toggle or a hidden environment variable that would change the allocator's behavior. None of that was the answer. The runtime was doing what it was designed to do. The protocol around the runtime was what needed to change. Configuration tuning, on a card with this little headroom, is rearranging deck chairs on a fragmented allocator.

Third, I would have wired the kill signal into the dispatcher earlier. The Mac was already doing the work of sending prompts and pulling renders. Adding a model-swap step to the same script took about twenty minutes and would have prevented the entire incident if I had written it before round 28. The pattern is the same pattern as the per-concept routing rule that decides which model the prompt goes to: the routing decision happens at dispatch time, with full information, before the card is asked to do anything ambiguous.

A small anti-feature worth calling out: the protocol does not try to swap models without a restart. That is a deliberate choice. On 8GB the savings of a clean swap, when it works, are not worth the half-hour cost when it does not. The relaunch overhead of twenty seconds is the reliable price. I would rather pay it every time than gamble on the dirty path occasionally.

The same instinct shows up in the verdict ledger that catches the downstream regressions when a swap-related bug slips through. If a generation comes back wrong because the model loaded into a polluted state, the visual signal usually shows up before the next render does, and the verdict ledger that catches downstream regressions is what flags it. The protocol prevents the bug; the ledger catches the survivors.

The render box itself has earned its keep on enough hero work that the protocol is now a small line item in a much larger story. The case studies the render box has been earning its keep on all ran through this pipeline in some form, which is why the half-hour I lost in round 28 was worth turning into a written rule. The full toolchain pattern lives in the productized version of this stack, where this kind of small protocol fix gets documented next to the larger workflow patterns it supports.

Ultra-wide distant vantage of a low industrial silhouette against deep electric-blue sky, single warm pink window glowing far across the frame. — // the long view · pink window across the dark

Takeaways

On an 8GB card, never swap diffusion models inside a single ComfyUI session. The CUDA allocator fragments under a managed Python environment, and the next model's largest contiguous tensor will not fit in the free fragments.
The ComfyUI /free node is a soft request, not a guaranteed unload. Verify with nvidia-smi, not the front-end status bar. The runtime is the source of truth.
If generation OOMs on the first call after a model swap, do not retry. Restart Pinokio. Each retry leaves more allocator state behind, not less.
Watch nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader -l 1 until the python process is gone and VRAM is below 1GB before launching the next model. Anything else is a zombie state.
Wire the model-swap kill signal into the dispatcher. Treat a model family change the same way you would treat a machine reboot. The relaunch overhead is twenty seconds; the failure cost is half an hour.

GPU swap management FAQ

Why does ComfyUI's /free node not actually free VRAM?

The /free node sends a request to the ComfyUI runtime to drop its references to the loaded model. ComfyUI then asks PyTorch to release the underlying tensors, and PyTorch asks the CUDA allocator to return the memory to the GPU. Each of those layers can succeed at its part without the next layer following through. Under Pinokio's managed Python environment, the most common failure is that PyTorch's caching allocator holds the memory in its internal pool rather than returning it to the OS-visible pool. The front-end thinks the model is unloaded, the card disagrees, and nvidia-smi is the one telling the truth.

Does this only happen on 8GB cards?

The fragmentation happens on every card. The pain only happens on cards with thin headroom. On a 24GB card the fragmented free space is usually larger than any single tensor the next model needs, so the swap silently succeeds. On 8GB the headroom is so thin (about 700MB after a Z-Image load at 1024x1024) that even minor fragmentation closes the door on the next model. If you are on a 12GB card you might see this occasionally; on 16GB or larger you probably will not see it during normal model swaps.

Is this a Pinokio-specific bug?

No. The same allocator behavior happens in raw ComfyUI installs, Automatic1111, and any PyTorch-based diffusion stack on Windows. Pinokio is just the front-end I happen to use because Z-Fusion installs cleanly. The fix protocol works across all of them: restart the Python process, do not try to swap inside a session, watch nvidia-smi during the restart. The vendor changes; the protocol does not.

Can I avoid this with PYTORCH_CUDA_ALLOC_CONF or other env vars?

Tuning the allocator config can reduce fragmentation in long-running sessions, but it does not eliminate it. Settings like expandable_segments:True help on Linux and have mixed support on Windows depending on driver version. I tried two variants of PYTORCH_CUDA_ALLOC_CONF and got marginally fewer fragmentation events, but I still hit zombie state often enough that the restart protocol stayed worth it. Configuration moves the bar; protocol clears it.

How do I know I am in zombie state without nvidia-smi?

Three signals together: the front-end says the model is unloaded, the next generation OOMs immediately on the first inference, and a second retry OOMs in the same way without any meaningful delay. Any one of those alone is ambiguous. All three together is zombie state, and the only fix is restart. If you are willing to install one tool, install nvidia-smi (it ships with the NVIDIA driver) and skip the indirect signals; the direct reading is faster and surer.

Sources and specifics

The incident occurred during round 28 of the bespoke hero image pipeline for this site, Q1 2026. The render box is a Windows tower with an RTX 3070 8GB running Pinokio with the Z-Fusion ComfyUI bundle. The Mac on the prompt side dispatches over SSH.
Resident memory bands measured by nvidia-smi during the incident: Z-Image Turbo at 1024x1024 around 7.3GB; the python process held 5.1GB after /free; residual after process kill dropped to 380MB.
The diagnostic command kept in shell history: nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader -l 1. The -l 1 flag refreshes once per second; ample resolution to watch the two-stage drop on process tear-down.
The protocol is four written rules: no in-session model swaps on 8GB, watch nvidia-smi during restart, no retries on first-call OOM, dispatch one model per Pinokio session. All four are now in the dispatcher's prologue on the Mac side.
Pipeline context: the same dual-machine setup runs Z-Image Turbo for single-subject heroes and Flux Schnell for layered prompts, with routing per concept rather than per project. The model-swap protocol is the layer that makes the per-concept routing actually safe to execute on the 8GB tier.
The fragmentation pattern is not Pinokio-specific. It reproduces across Automatic1111, vanilla ComfyUI, and any PyTorch-based diffusion stack on Windows when total VRAM headroom is below roughly 1GB after the resident model load.