This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Finanzamt_Endgegner on 2025-02-08 18:06:13+00:00.


(could also improve max resolution for low end cards in flux)

Simply put, my goal is to gather data on how long you can generate Hunyuan Videos using your setups. Please share your setups (primarily GPUs) along with your generation settings – including the model/quantization, FPS/resolution, and any additional parameters (s/it). The aim is to see how far we can push the generation process with various optimizations. Tip: for improved generation speed, install Triton and Sage Attention.

This optimization relies on the multi-GPU nodes available at ComfyUI-MultiGPU, specifically the torchdist nodes. Without going into too much detail, the developer discovered that most of the model loaded into VRAM isn’t really needed there; it can be offloaded to free up VRAM for latent space. This means you can produce longer and/or higher-resolution videos at the same generation speed. At the moment, the process is somewhat finicky: you need to use the multi-GPU nodes for each loader in your Hunyuan Video workflow and load everything on either a secondary GPU or the CPU/system memory—except for the main model. For the main model, you’ll need to use the torchdist node and set the main GPU as the primary device (not sure if it only works with ggufs though), allocating only about 1% of its resources while offloading the rest to the CPU. This forces all non-essential data to be moved to system memory.

my current settings with the old version, which already got an update!

This won’t affect your generation performance, since that portion is still processed on the GPU. You can now iteratively increase the number of frames or the resolution and see if you encounter out-of-memory errors. If you do, that indicates the maximum capacity of your current hardware and quantization settings. For example, I have an RTX4070Ti with 12 GB VRAM, and I was able to generate 24 fps videos with 189 frames (approximately 8 seconds) in about 6 minutes. Although the current implementation isn’t perfect, it works as a proof of concept—for me, the developer, and several others. With your help, we’ll see if this method works across different configurations and maybe revolutionize Confyui video generation! All credit to Silent-Adagio-444!

Workflow: 

(the vae is currently loaded onto the cpu, but that takes ages, if you want to go for max res/frames go for it, if you got a secondary gpu, load it onto that one for speed, but its not that big of a deal if it gets loaded onto the main gpu either)

Here is an example for the power of this node:

720x1280@24fps for ~3s at high quality

(would be considerably faster over all if the models were already in ram btw)

The image Quality can obviously be improved by better prompting etc.