Repository: huggingface/diffusers Branch: main Commit: e9b9f25f677b Files: 2484 Total size: 42.4 MB Directory structure: gitextract_5vldy3zt/ ├── .ai/ │ ├── AGENTS.md │ └── skills/ │ ├── model-integration/ │ │ ├── SKILL.md │ │ └── modular-conversion.md │ └── parity-testing/ │ ├── SKILL.md │ ├── checkpoint-mechanism.md │ └── pitfalls.md ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── bug-report.yml │ │ ├── config.yml │ │ ├── feature_request.md │ │ ├── feedback.md │ │ ├── new-model-addition.yml │ │ ├── remote-vae-pilot-feedback.yml │ │ └── translate.md │ ├── PULL_REQUEST_TEMPLATE.md │ ├── actions/ │ │ └── setup-miniconda/ │ │ └── action.yml │ └── workflows/ │ ├── benchmark.yml │ ├── build_docker_images.yml │ ├── build_documentation.yml │ ├── build_pr_documentation.yml │ ├── codeql.yml │ ├── mirror_community_pipeline.yml │ ├── nightly_tests.yml │ ├── notify_slack_about_release.yml │ ├── pr_dependency_test.yml │ ├── pr_modular_tests.yml │ ├── pr_style_bot.yml │ ├── pr_test_fetcher.yml │ ├── pr_tests.yml │ ├── pr_tests_gpu.yml │ ├── pr_torch_dependency_test.yml │ ├── push_tests.yml │ ├── push_tests_fast.yml │ ├── push_tests_mps.yml │ ├── pypi_publish.yaml │ ├── release_tests_fast.yml │ ├── run_tests_from_a_pr.yml │ ├── ssh-pr-runner.yml │ ├── ssh-runner.yml │ ├── stale.yml │ ├── trufflehog.yml │ ├── typos.yml │ ├── update_metadata.yml │ └── upload_pr_documentation.yml ├── .gitignore ├── CITATION.cff ├── CODE_OF_CONDUCT.md ├── LICENSE ├── MANIFEST.in ├── Makefile ├── PHILOSOPHY.md ├── README.md ├── _typos.toml ├── benchmarks/ │ ├── README.md │ ├── __init__.py │ ├── benchmarking_flux.py │ ├── benchmarking_ltx.py │ ├── benchmarking_sdxl.py │ ├── benchmarking_utils.py │ ├── benchmarking_wan.py │ ├── push_results.py │ ├── requirements.txt │ └── run_all.py ├── docker/ │ ├── diffusers-doc-builder/ │ │ └── Dockerfile │ ├── diffusers-onnxruntime-cpu/ │ │ └── Dockerfile │ ├── diffusers-onnxruntime-cuda/ │ │ └── Dockerfile │ ├── diffusers-pytorch-cpu/ │ │ └── Dockerfile │ ├── diffusers-pytorch-cuda/ │ │ └── Dockerfile │ ├── diffusers-pytorch-minimum-cuda/ │ │ └── Dockerfile │ └── diffusers-pytorch-xformers-cuda/ │ └── Dockerfile ├── docs/ │ ├── README.md │ ├── TRANSLATING.md │ └── source/ │ ├── _config.py │ ├── en/ │ │ ├── _toctree.yml │ │ ├── advanced_inference/ │ │ │ └── outpaint.md │ │ ├── api/ │ │ │ ├── activations.md │ │ │ ├── attnprocessor.md │ │ │ ├── cache.md │ │ │ ├── configuration.md │ │ │ ├── image_processor.md │ │ │ ├── internal_classes_overview.md │ │ │ ├── loaders/ │ │ │ │ ├── ip_adapter.md │ │ │ │ ├── lora.md │ │ │ │ ├── peft.md │ │ │ │ ├── single_file.md │ │ │ │ ├── textual_inversion.md │ │ │ │ ├── transformer_sd3.md │ │ │ │ └── unet.md │ │ │ ├── logging.md │ │ │ ├── models/ │ │ │ │ ├── allegro_transformer3d.md │ │ │ │ ├── asymmetricautoencoderkl.md │ │ │ │ ├── aura_flow_transformer2d.md │ │ │ │ ├── auto_model.md │ │ │ │ ├── autoencoder_dc.md │ │ │ │ ├── autoencoder_kl_hunyuan_video.md │ │ │ │ ├── autoencoder_kl_hunyuan_video15.md │ │ │ │ ├── autoencoder_kl_hunyuanimage.md │ │ │ │ ├── autoencoder_kl_hunyuanimage_refiner.md │ │ │ │ ├── autoencoder_kl_wan.md │ │ │ │ ├── autoencoder_oobleck.md │ │ │ │ ├── autoencoder_rae.md │ │ │ │ ├── autoencoder_tiny.md │ │ │ │ ├── autoencoderkl.md │ │ │ │ ├── autoencoderkl_allegro.md │ │ │ │ ├── autoencoderkl_audio_ltx_2.md │ │ │ │ ├── autoencoderkl_cogvideox.md │ │ │ │ ├── autoencoderkl_cosmos.md │ │ │ │ ├── autoencoderkl_ltx_2.md │ │ │ │ ├── autoencoderkl_ltx_video.md │ │ │ │ ├── autoencoderkl_magvit.md │ │ │ │ ├── autoencoderkl_mochi.md │ │ │ │ ├── autoencoderkl_qwenimage.md │ │ │ │ ├── bria_transformer.md │ │ │ │ ├── chroma_transformer.md │ │ │ │ ├── chronoedit_transformer_3d.md │ │ │ │ ├── cogvideox_transformer3d.md │ │ │ │ ├── cogview3plus_transformer2d.md │ │ │ │ ├── cogview4_transformer2d.md │ │ │ │ ├── consisid_transformer3d.md │ │ │ │ ├── consistency_decoder_vae.md │ │ │ │ ├── controlnet.md │ │ │ │ ├── controlnet_flux.md │ │ │ │ ├── controlnet_hunyuandit.md │ │ │ │ ├── controlnet_sana.md │ │ │ │ ├── controlnet_sd3.md │ │ │ │ ├── controlnet_sparsectrl.md │ │ │ │ ├── controlnet_union.md │ │ │ │ ├── cosmos_transformer3d.md │ │ │ │ ├── dit_transformer2d.md │ │ │ │ ├── easyanimate_transformer3d.md │ │ │ │ ├── flux2_transformer.md │ │ │ │ ├── flux_transformer.md │ │ │ │ ├── glm_image_transformer2d.md │ │ │ │ ├── helios_transformer3d.md │ │ │ │ ├── hidream_image_transformer.md │ │ │ │ ├── hunyuan_transformer2d.md │ │ │ │ ├── hunyuan_video15_transformer_3d.md │ │ │ │ ├── hunyuan_video_transformer_3d.md │ │ │ │ ├── hunyuanimage_transformer_2d.md │ │ │ │ ├── latte_transformer3d.md │ │ │ │ ├── longcat_image_transformer2d.md │ │ │ │ ├── ltx2_video_transformer3d.md │ │ │ │ ├── ltx_video_transformer3d.md │ │ │ │ ├── lumina2_transformer2d.md │ │ │ │ ├── lumina_nextdit2d.md │ │ │ │ ├── mochi_transformer3d.md │ │ │ │ ├── omnigen_transformer.md │ │ │ │ ├── overview.md │ │ │ │ ├── ovisimage_transformer2d.md │ │ │ │ ├── pixart_transformer2d.md │ │ │ │ ├── prior_transformer.md │ │ │ │ ├── qwenimage_transformer2d.md │ │ │ │ ├── sana_transformer2d.md │ │ │ │ ├── sana_video_transformer3d.md │ │ │ │ ├── sd3_transformer2d.md │ │ │ │ ├── skyreels_v2_transformer_3d.md │ │ │ │ ├── stable_audio_transformer.md │ │ │ │ ├── stable_cascade_unet.md │ │ │ │ ├── transformer2d.md │ │ │ │ ├── transformer_bria_fibo.md │ │ │ │ ├── transformer_temporal.md │ │ │ │ ├── unet-motion.md │ │ │ │ ├── unet.md │ │ │ │ ├── unet2d-cond.md │ │ │ │ ├── unet2d.md │ │ │ │ ├── unet3d-cond.md │ │ │ │ ├── uvit2d.md │ │ │ │ ├── vq.md │ │ │ │ ├── wan_animate_transformer_3d.md │ │ │ │ ├── wan_transformer_3d.md │ │ │ │ └── z_image_transformer2d.md │ │ │ ├── modular_diffusers/ │ │ │ │ ├── guiders.md │ │ │ │ ├── pipeline.md │ │ │ │ ├── pipeline_blocks.md │ │ │ │ ├── pipeline_components.md │ │ │ │ └── pipeline_states.md │ │ │ ├── normalization.md │ │ │ ├── outputs.md │ │ │ ├── parallel.md │ │ │ ├── pipelines/ │ │ │ │ ├── allegro.md │ │ │ │ ├── amused.md │ │ │ │ ├── animatediff.md │ │ │ │ ├── attend_and_excite.md │ │ │ │ ├── audioldm.md │ │ │ │ ├── audioldm2.md │ │ │ │ ├── aura_flow.md │ │ │ │ ├── auto_pipeline.md │ │ │ │ ├── blip_diffusion.md │ │ │ │ ├── bria_3_2.md │ │ │ │ ├── bria_fibo.md │ │ │ │ ├── bria_fibo_edit.md │ │ │ │ ├── chroma.md │ │ │ │ ├── chronoedit.md │ │ │ │ ├── cogvideox.md │ │ │ │ ├── cogview3.md │ │ │ │ ├── cogview4.md │ │ │ │ ├── consisid.md │ │ │ │ ├── consistency_models.md │ │ │ │ ├── control_flux_inpaint.md │ │ │ │ ├── controlnet.md │ │ │ │ ├── controlnet_flux.md │ │ │ │ ├── controlnet_hunyuandit.md │ │ │ │ ├── controlnet_sana.md │ │ │ │ ├── controlnet_sd3.md │ │ │ │ ├── controlnet_sdxl.md │ │ │ │ ├── controlnet_union.md │ │ │ │ ├── controlnetxs.md │ │ │ │ ├── controlnetxs_sdxl.md │ │ │ │ ├── cosmos.md │ │ │ │ ├── dance_diffusion.md │ │ │ │ ├── ddim.md │ │ │ │ ├── ddpm.md │ │ │ │ ├── deepfloyd_if.md │ │ │ │ ├── diffedit.md │ │ │ │ ├── dit.md │ │ │ │ ├── easyanimate.md │ │ │ │ ├── flux.md │ │ │ │ ├── flux2.md │ │ │ │ ├── framepack.md │ │ │ │ ├── glm_image.md │ │ │ │ ├── helios.md │ │ │ │ ├── hidream.md │ │ │ │ ├── hunyuan_video.md │ │ │ │ ├── hunyuan_video15.md │ │ │ │ ├── hunyuandit.md │ │ │ │ ├── hunyuanimage21.md │ │ │ │ ├── i2vgenxl.md │ │ │ │ ├── kandinsky.md │ │ │ │ ├── kandinsky3.md │ │ │ │ ├── kandinsky5_image.md │ │ │ │ ├── kandinsky5_video.md │ │ │ │ ├── kandinsky_v22.md │ │ │ │ ├── kolors.md │ │ │ │ ├── latent_consistency_models.md │ │ │ │ ├── latent_diffusion.md │ │ │ │ ├── latte.md │ │ │ │ ├── ledits_pp.md │ │ │ │ ├── longcat_image.md │ │ │ │ ├── ltx2.md │ │ │ │ ├── ltx_video.md │ │ │ │ ├── lumina.md │ │ │ │ ├── lumina2.md │ │ │ │ ├── marigold.md │ │ │ │ ├── mochi.md │ │ │ │ ├── musicldm.md │ │ │ │ ├── omnigen.md │ │ │ │ ├── overview.md │ │ │ │ ├── ovis_image.md │ │ │ │ ├── pag.md │ │ │ │ ├── paint_by_example.md │ │ │ │ ├── panorama.md │ │ │ │ ├── pia.md │ │ │ │ ├── pix2pix.md │ │ │ │ ├── pixart.md │ │ │ │ ├── pixart_sigma.md │ │ │ │ ├── prx.md │ │ │ │ ├── qwenimage.md │ │ │ │ ├── sana.md │ │ │ │ ├── sana_sprint.md │ │ │ │ ├── sana_video.md │ │ │ │ ├── self_attention_guidance.md │ │ │ │ ├── semantic_stable_diffusion.md │ │ │ │ ├── shap_e.md │ │ │ │ ├── skyreels_v2.md │ │ │ │ ├── stable_audio.md │ │ │ │ ├── stable_cascade.md │ │ │ │ ├── stable_diffusion/ │ │ │ │ │ ├── adapter.md │ │ │ │ │ ├── depth2img.md │ │ │ │ │ ├── gligen.md │ │ │ │ │ ├── image_variation.md │ │ │ │ │ ├── img2img.md │ │ │ │ │ ├── inpaint.md │ │ │ │ │ ├── latent_upscale.md │ │ │ │ │ ├── ldm3d_diffusion.md │ │ │ │ │ ├── overview.md │ │ │ │ │ ├── sdxl_turbo.md │ │ │ │ │ ├── stable_diffusion_2.md │ │ │ │ │ ├── stable_diffusion_3.md │ │ │ │ │ ├── stable_diffusion_safe.md │ │ │ │ │ ├── stable_diffusion_xl.md │ │ │ │ │ ├── svd.md │ │ │ │ │ ├── text2img.md │ │ │ │ │ └── upscale.md │ │ │ │ ├── stable_unclip.md │ │ │ │ ├── text_to_video.md │ │ │ │ ├── text_to_video_zero.md │ │ │ │ ├── unclip.md │ │ │ │ ├── unidiffuser.md │ │ │ │ ├── value_guided_sampling.md │ │ │ │ ├── visualcloze.md │ │ │ │ ├── wan.md │ │ │ │ ├── wuerstchen.md │ │ │ │ └── z_image.md │ │ │ ├── quantization.md │ │ │ ├── schedulers/ │ │ │ │ ├── cm_stochastic_iterative.md │ │ │ │ ├── consistency_decoder.md │ │ │ │ ├── cosine_dpm.md │ │ │ │ ├── ddim.md │ │ │ │ ├── ddim_cogvideox.md │ │ │ │ ├── ddim_inverse.md │ │ │ │ ├── ddpm.md │ │ │ │ ├── deis.md │ │ │ │ ├── dpm_discrete.md │ │ │ │ ├── dpm_discrete_ancestral.md │ │ │ │ ├── dpm_sde.md │ │ │ │ ├── edm_euler.md │ │ │ │ ├── edm_multistep_dpm_solver.md │ │ │ │ ├── euler.md │ │ │ │ ├── euler_ancestral.md │ │ │ │ ├── flow_match_euler_discrete.md │ │ │ │ ├── flow_match_heun_discrete.md │ │ │ │ ├── helios.md │ │ │ │ ├── helios_dmd.md │ │ │ │ ├── heun.md │ │ │ │ ├── ipndm.md │ │ │ │ ├── lcm.md │ │ │ │ ├── lms_discrete.md │ │ │ │ ├── multistep_dpm_solver.md │ │ │ │ ├── multistep_dpm_solver_cogvideox.md │ │ │ │ ├── multistep_dpm_solver_inverse.md │ │ │ │ ├── overview.md │ │ │ │ ├── pndm.md │ │ │ │ ├── repaint.md │ │ │ │ ├── score_sde_ve.md │ │ │ │ ├── score_sde_vp.md │ │ │ │ ├── singlestep_dpm_solver.md │ │ │ │ ├── stochastic_karras_ve.md │ │ │ │ ├── tcd.md │ │ │ │ ├── unipc.md │ │ │ │ └── vq_diffusion.md │ │ │ ├── utilities.md │ │ │ └── video_processor.md │ │ ├── community_projects.md │ │ ├── conceptual/ │ │ │ ├── contribution.md │ │ │ ├── ethical_guidelines.md │ │ │ ├── evaluation.md │ │ │ └── philosophy.md │ │ ├── hybrid_inference/ │ │ │ ├── api_reference.md │ │ │ └── overview.md │ │ ├── index.md │ │ ├── installation.md │ │ ├── modular_diffusers/ │ │ │ ├── auto_pipeline_blocks.md │ │ │ ├── components_manager.md │ │ │ ├── custom_blocks.md │ │ │ ├── loop_sequential_pipeline_blocks.md │ │ │ ├── mellon.md │ │ │ ├── modular_diffusers_states.md │ │ │ ├── modular_pipeline.md │ │ │ ├── overview.md │ │ │ ├── pipeline_block.md │ │ │ ├── quickstart.md │ │ │ └── sequential_pipeline_blocks.md │ │ ├── optimization/ │ │ │ ├── attention_backends.md │ │ │ ├── cache.md │ │ │ ├── cache_dit.md │ │ │ ├── coreml.md │ │ │ ├── deepcache.md │ │ │ ├── fp16.md │ │ │ ├── habana.md │ │ │ ├── memory.md │ │ │ ├── mps.md │ │ │ ├── neuron.md │ │ │ ├── onnx.md │ │ │ ├── open_vino.md │ │ │ ├── para_attn.md │ │ │ ├── pruna.md │ │ │ ├── speed-memory-optims.md │ │ │ ├── tgate.md │ │ │ ├── tome.md │ │ │ ├── xdit.md │ │ │ └── xformers.md │ │ ├── quantization/ │ │ │ ├── bitsandbytes.md │ │ │ ├── gguf.md │ │ │ ├── modelopt.md │ │ │ ├── overview.md │ │ │ ├── quanto.md │ │ │ └── torchao.md │ │ ├── quicktour.md │ │ ├── stable_diffusion.md │ │ ├── training/ │ │ │ ├── adapt_a_model.md │ │ │ ├── cogvideox.md │ │ │ ├── controlnet.md │ │ │ ├── create_dataset.md │ │ │ ├── custom_diffusion.md │ │ │ ├── ddpo.md │ │ │ ├── distributed_inference.md │ │ │ ├── dreambooth.md │ │ │ ├── instructpix2pix.md │ │ │ ├── kandinsky.md │ │ │ ├── lcm_distill.md │ │ │ ├── lora.md │ │ │ ├── overview.md │ │ │ ├── sdxl.md │ │ │ ├── t2i_adapters.md │ │ │ ├── text2image.md │ │ │ ├── text_inversion.md │ │ │ ├── unconditional_training.md │ │ │ └── wuerstchen.md │ │ ├── tutorials/ │ │ │ ├── autopipeline.md │ │ │ ├── basic_training.md │ │ │ └── using_peft_for_inference.md │ │ └── using-diffusers/ │ │ ├── automodel.md │ │ ├── batched_inference.md │ │ ├── callback.md │ │ ├── conditional_image_generation.md │ │ ├── consisid.md │ │ ├── controlling_generation.md │ │ ├── controlnet.md │ │ ├── create_a_server.md │ │ ├── custom_pipeline_overview.md │ │ ├── depth2img.md │ │ ├── diffedit.md │ │ ├── dreambooth.md │ │ ├── guiders.md │ │ ├── helios.md │ │ ├── image_quality.md │ │ ├── img2img.md │ │ ├── inference_with_lcm.md │ │ ├── inference_with_tcd_lora.md │ │ ├── inpaint.md │ │ ├── ip_adapter.md │ │ ├── kandinsky.md │ │ ├── loading.md │ │ ├── marigold_usage.md │ │ ├── omnigen.md │ │ ├── other-formats.md │ │ ├── pag.md │ │ ├── push_to_hub.md │ │ ├── reusing_seeds.md │ │ ├── schedulers.md │ │ ├── sdxl.md │ │ ├── sdxl_turbo.md │ │ ├── shap-e.md │ │ ├── svd.md │ │ ├── t2i_adapter.md │ │ ├── text-img2vid.md │ │ ├── textual_inversion_inference.md │ │ ├── unconditional_image_generation.md │ │ ├── weighted_prompts.md │ │ └── write_own_pipeline.md │ ├── ja/ │ │ ├── _toctree.yml │ │ ├── index.md │ │ ├── installation.md │ │ ├── quicktour.md │ │ ├── stable_diffusion.md │ │ └── tutorials/ │ │ ├── autopipeline.md │ │ └── tutorial_overview.md │ ├── ko/ │ │ ├── _toctree.yml │ │ ├── api/ │ │ │ └── pipelines/ │ │ │ └── stable_diffusion/ │ │ │ └── stable_diffusion_xl.md │ │ ├── conceptual/ │ │ │ ├── contribution.md │ │ │ ├── ethical_guidelines.md │ │ │ ├── evaluation.md │ │ │ └── philosophy.md │ │ ├── in_translation.md │ │ ├── index.md │ │ ├── installation.md │ │ ├── optimization/ │ │ │ ├── coreml.md │ │ │ ├── fp16.md │ │ │ ├── habana.md │ │ │ ├── mps.md │ │ │ ├── onnx.md │ │ │ ├── open_vino.md │ │ │ ├── tome.md │ │ │ ├── torch2.0.md │ │ │ └── xformers.md │ │ ├── quicktour.md │ │ ├── stable_diffusion.md │ │ ├── training/ │ │ │ ├── adapt_a_model.md │ │ │ ├── controlnet.md │ │ │ ├── create_dataset.md │ │ │ ├── custom_diffusion.md │ │ │ ├── distributed_inference.md │ │ │ ├── dreambooth.md │ │ │ ├── instructpix2pix.md │ │ │ ├── lora.md │ │ │ ├── overview.md │ │ │ ├── text2image.md │ │ │ ├── text_inversion.md │ │ │ └── unconditional_training.md │ │ ├── tutorials/ │ │ │ ├── basic_training.md │ │ │ └── tutorial_overview.md │ │ └── using-diffusers/ │ │ ├── conditional_image_generation.md │ │ ├── controlling_generation.md │ │ ├── custom_pipeline_overview.md │ │ ├── depth2img.md │ │ ├── diffedit.md │ │ ├── img2img.md │ │ ├── inpaint.md │ │ ├── kandinsky.md │ │ ├── loading.md │ │ ├── loading_adapters.md │ │ ├── other-formats.md │ │ ├── push_to_hub.md │ │ ├── schedulers.md │ │ ├── sdxl_turbo.md │ │ ├── shap-e.md │ │ ├── stable_diffusion_jax_how_to.md │ │ ├── svd.md │ │ ├── textual_inversion_inference.md │ │ ├── unconditional_image_generation.md │ │ ├── weighted_prompts.md │ │ └── write_own_pipeline.md │ ├── pt/ │ │ ├── _toctree.yml │ │ ├── index.md │ │ ├── installation.md │ │ ├── quicktour.md │ │ └── stable_diffusion.md │ └── zh/ │ ├── _toctree.yml │ ├── community_projects.md │ ├── conceptual/ │ │ ├── contribution.md │ │ ├── ethical_guidelines.md │ │ ├── evaluation.md │ │ └── philosophy.md │ ├── hybrid_inference/ │ │ ├── api_reference.md │ │ ├── overview.md │ │ └── vae_encode.md │ ├── index.md │ ├── installation.md │ ├── modular_diffusers/ │ │ ├── auto_pipeline_blocks.md │ │ ├── components_manager.md │ │ ├── loop_sequential_pipeline_blocks.md │ │ ├── modular_diffusers_states.md │ │ ├── modular_pipeline.md │ │ ├── overview.md │ │ ├── pipeline_block.md │ │ ├── quickstart.md │ │ └── sequential_pipeline_blocks.md │ ├── optimization/ │ │ ├── cache.md │ │ ├── coreml.md │ │ ├── deepcache.md │ │ ├── fp16.md │ │ ├── habana.md │ │ ├── memory.md │ │ ├── mps.md │ │ ├── neuron.md │ │ ├── onnx.md │ │ ├── open_vino.md │ │ ├── para_attn.md │ │ ├── pruna.md │ │ ├── speed-memory-optims.md │ │ ├── tgate.md │ │ ├── tome.md │ │ ├── xdit.md │ │ └── xformers.md │ ├── quicktour.md │ ├── stable_diffusion.md │ ├── training/ │ │ ├── adapt_a_model.md │ │ ├── controlnet.md │ │ ├── distributed_inference.md │ │ ├── dreambooth.md │ │ ├── instructpix2pix.md │ │ ├── kandinsky.md │ │ ├── lora.md │ │ ├── overview.md │ │ ├── text2image.md │ │ ├── text_inversion.md │ │ └── wuerstchen.md │ └── using-diffusers/ │ ├── consisid.md │ ├── guiders.md │ ├── helios.md │ └── schedulers.md ├── examples/ │ ├── README.md │ ├── advanced_diffusion_training/ │ │ ├── README.md │ │ ├── README_flux.md │ │ ├── requirements.txt │ │ ├── requirements_flux.txt │ │ ├── test_dreambooth_lora_flux_advanced.py │ │ ├── train_dreambooth_lora_flux_advanced.py │ │ ├── train_dreambooth_lora_sd15_advanced.py │ │ └── train_dreambooth_lora_sdxl_advanced.py │ ├── amused/ │ │ ├── README.md │ │ └── train_amused.py │ ├── cogvideo/ │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── train_cogvideox_image_to_video_lora.py │ │ └── train_cogvideox_lora.py │ ├── cogview4-control/ │ │ ├── README.md │ │ ├── requirements.txt │ │ └── train_control_cogview4.py │ ├── community/ │ │ ├── README.md │ │ ├── README_community_scripts.md │ │ ├── adaptive_mask_inpainting.py │ │ ├── bit_diffusion.py │ │ ├── checkpoint_merger.py │ │ ├── clip_guided_images_mixing_stable_diffusion.py │ │ ├── clip_guided_stable_diffusion.py │ │ ├── clip_guided_stable_diffusion_img2img.py │ │ ├── cogvideox_ddim_inversion.py │ │ ├── composable_stable_diffusion.py │ │ ├── ddim_noise_comparative_analysis.py │ │ ├── dps_pipeline.py │ │ ├── edict_pipeline.py │ │ ├── fresco_v2v.py │ │ ├── gluegen.py │ │ ├── hd_painter.py │ │ ├── iadb.py │ │ ├── imagic_stable_diffusion.py │ │ ├── img2img_inpainting.py │ │ ├── instaflow_one_step.py │ │ ├── interpolate_stable_diffusion.py │ │ ├── ip_adapter_face_id.py │ │ ├── kohya_hires_fix.py │ │ ├── latent_consistency_img2img.py │ │ ├── latent_consistency_interpolate.py │ │ ├── latent_consistency_txt2img.py │ │ ├── llm_grounded_diffusion.py │ │ ├── lpw_stable_diffusion.py │ │ ├── lpw_stable_diffusion_onnx.py │ │ ├── lpw_stable_diffusion_xl.py │ │ ├── magic_mix.py │ │ ├── marigold_depth_estimation.py │ │ ├── masked_stable_diffusion_img2img.py │ │ ├── masked_stable_diffusion_xl_img2img.py │ │ ├── matryoshka.py │ │ ├── mixture_canvas.py │ │ ├── mixture_tiling.py │ │ ├── mixture_tiling_sdxl.py │ │ ├── mod_controlnet_tile_sr_sdxl.py │ │ ├── multilingual_stable_diffusion.py │ │ ├── one_step_unet.py │ │ ├── pipeline_animatediff_controlnet.py │ │ ├── pipeline_animatediff_img2video.py │ │ ├── pipeline_animatediff_ipex.py │ │ ├── pipeline_controlnet_xl_kolors.py │ │ ├── pipeline_controlnet_xl_kolors_img2img.py │ │ ├── pipeline_controlnet_xl_kolors_inpaint.py │ │ ├── pipeline_demofusion_sdxl.py │ │ ├── pipeline_fabric.py │ │ ├── pipeline_faithdiff_stable_diffusion_xl.py │ │ ├── pipeline_flux_differential_img2img.py │ │ ├── pipeline_flux_kontext_multiple_images.py │ │ ├── pipeline_flux_rf_inversion.py │ │ ├── pipeline_flux_semantic_guidance.py │ │ ├── pipeline_flux_with_cfg.py │ │ ├── pipeline_hunyuandit_differential_img2img.py │ │ ├── pipeline_kolors_differential_img2img.py │ │ ├── pipeline_kolors_inpainting.py │ │ ├── pipeline_null_text_inversion.py │ │ ├── pipeline_prompt2prompt.py │ │ ├── pipeline_sdxl_style_aligned.py │ │ ├── pipeline_stable_diffusion_3_differential_img2img.py │ │ ├── pipeline_stable_diffusion_3_instruct_pix2pix.py │ │ ├── pipeline_stable_diffusion_boxdiff.py │ │ ├── pipeline_stable_diffusion_pag.py │ │ ├── pipeline_stable_diffusion_upscale_ldm3d.py │ │ ├── pipeline_stable_diffusion_xl_attentive_eraser.py │ │ ├── pipeline_stable_diffusion_xl_controlnet_adapter.py │ │ ├── pipeline_stable_diffusion_xl_controlnet_adapter_inpaint.py │ │ ├── pipeline_stable_diffusion_xl_differential_img2img.py │ │ ├── pipeline_stable_diffusion_xl_instandid_img2img.py │ │ ├── pipeline_stable_diffusion_xl_instantid.py │ │ ├── pipeline_stable_diffusion_xl_ipex.py │ │ ├── pipeline_stable_diffusion_xl_t5.py │ │ ├── pipeline_stg_cogvideox.py │ │ ├── pipeline_stg_hunyuan_video.py │ │ ├── pipeline_stg_ltx.py │ │ ├── pipeline_stg_ltx_image2video.py │ │ ├── pipeline_stg_mochi.py │ │ ├── pipeline_stg_wan.py │ │ ├── pipeline_z_image_differential_img2img.py │ │ ├── pipeline_zero1to3.py │ │ ├── pipline_flux_fill_controlnet_Inpaint.py │ │ ├── regional_prompting_stable_diffusion.py │ │ ├── rerender_a_video.py │ │ ├── run_onnx_controlnet.py │ │ ├── run_tensorrt_controlnet.py │ │ ├── scheduling_ufogen.py │ │ ├── sd_text2img_k_diffusion.py │ │ ├── sde_drag.py │ │ ├── seed_resize_stable_diffusion.py │ │ ├── speech_to_image_diffusion.py │ │ ├── stable_diffusion_comparison.py │ │ ├── stable_diffusion_controlnet_img2img.py │ │ ├── stable_diffusion_controlnet_inpaint.py │ │ ├── stable_diffusion_controlnet_inpaint_img2img.py │ │ ├── stable_diffusion_controlnet_reference.py │ │ ├── stable_diffusion_ipex.py │ │ ├── stable_diffusion_mega.py │ │ ├── stable_diffusion_reference.py │ │ ├── stable_diffusion_repaint.py │ │ ├── stable_diffusion_tensorrt_img2img.py │ │ ├── stable_diffusion_tensorrt_inpaint.py │ │ ├── stable_diffusion_tensorrt_txt2img.py │ │ ├── stable_diffusion_xl_controlnet_reference.py │ │ ├── stable_diffusion_xl_reference.py │ │ ├── stable_unclip.py │ │ ├── text_inpainting.py │ │ ├── tiled_upscaling.py │ │ ├── unclip_image_interpolation.py │ │ ├── unclip_text_interpolation.py │ │ └── wildcard_stable_diffusion.py │ ├── conftest.py │ ├── consistency_distillation/ │ │ ├── README.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── test_lcm_lora.py │ │ ├── train_lcm_distill_lora_sd_wds.py │ │ ├── train_lcm_distill_lora_sdxl.py │ │ ├── train_lcm_distill_lora_sdxl_wds.py │ │ ├── train_lcm_distill_sd_wds.py │ │ └── train_lcm_distill_sdxl_wds.py │ ├── controlnet/ │ │ ├── README.md │ │ ├── README_flux.md │ │ ├── README_sd3.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── requirements_flax.txt │ │ ├── requirements_flux.txt │ │ ├── requirements_sd3.txt │ │ ├── requirements_sdxl.txt │ │ ├── test_controlnet.py │ │ ├── train_controlnet.py │ │ ├── train_controlnet_flax.py │ │ ├── train_controlnet_flux.py │ │ ├── train_controlnet_sd3.py │ │ └── train_controlnet_sdxl.py │ ├── custom_diffusion/ │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── retrieve.py │ │ ├── test_custom_diffusion.py │ │ └── train_custom_diffusion.py │ ├── dreambooth/ │ │ ├── README.md │ │ ├── README_flux.md │ │ ├── README_flux2.md │ │ ├── README_hidream.md │ │ ├── README_lumina2.md │ │ ├── README_qwen.md │ │ ├── README_sana.md │ │ ├── README_sd3.md │ │ ├── README_sdxl.md │ │ ├── README_z_image.md │ │ ├── convert_to_imagefolder.py │ │ ├── requirements.txt │ │ ├── requirements_flax.txt │ │ ├── requirements_flux.txt │ │ ├── requirements_hidream.txt │ │ ├── requirements_sana.txt │ │ ├── requirements_sd3.txt │ │ ├── requirements_sdxl.txt │ │ ├── test_dreambooth.py │ │ ├── test_dreambooth_flux.py │ │ ├── test_dreambooth_lora.py │ │ ├── test_dreambooth_lora_edm.py │ │ ├── test_dreambooth_lora_flux.py │ │ ├── test_dreambooth_lora_flux2.py │ │ ├── test_dreambooth_lora_flux2_klein.py │ │ ├── test_dreambooth_lora_flux_kontext.py │ │ ├── test_dreambooth_lora_hidream.py │ │ ├── test_dreambooth_lora_lumina2.py │ │ ├── test_dreambooth_lora_qwenimage.py │ │ ├── test_dreambooth_lora_sana.py │ │ ├── test_dreambooth_lora_sd3.py │ │ ├── test_dreambooth_sd3.py │ │ ├── train_dreambooth.py │ │ ├── train_dreambooth_flax.py │ │ ├── train_dreambooth_flux.py │ │ ├── train_dreambooth_lora.py │ │ ├── train_dreambooth_lora_flux.py │ │ ├── train_dreambooth_lora_flux2.py │ │ ├── train_dreambooth_lora_flux2_img2img.py │ │ ├── train_dreambooth_lora_flux2_klein.py │ │ ├── train_dreambooth_lora_flux2_klein_img2img.py │ │ ├── train_dreambooth_lora_flux_kontext.py │ │ ├── train_dreambooth_lora_hidream.py │ │ ├── train_dreambooth_lora_lumina2.py │ │ ├── train_dreambooth_lora_qwen_image.py │ │ ├── train_dreambooth_lora_sana.py │ │ ├── train_dreambooth_lora_sd3.py │ │ ├── train_dreambooth_lora_sdxl.py │ │ ├── train_dreambooth_lora_z_image.py │ │ └── train_dreambooth_sd3.py │ ├── flux-control/ │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── train_control_flux.py │ │ └── train_control_lora_flux.py │ ├── inference/ │ │ ├── README.md │ │ ├── image_to_image.py │ │ └── inpainting.py │ ├── instruct_pix2pix/ │ │ ├── README.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── test_instruct_pix2pix.py │ │ ├── train_instruct_pix2pix.py │ │ └── train_instruct_pix2pix_sdxl.py │ ├── kandinsky2_2/ │ │ └── text_to_image/ │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── train_text_to_image_decoder.py │ │ ├── train_text_to_image_lora_decoder.py │ │ ├── train_text_to_image_lora_prior.py │ │ └── train_text_to_image_prior.py │ ├── model_search/ │ │ ├── README.md │ │ ├── pipeline_easy.py │ │ └── requirements.txt │ ├── reinforcement_learning/ │ │ ├── README.md │ │ ├── diffusion_policy.py │ │ └── run_diffuser_locomotion.py │ ├── research_projects/ │ │ ├── README.md │ │ ├── anytext/ │ │ │ ├── README.md │ │ │ ├── anytext.py │ │ │ ├── anytext_controlnet.py │ │ │ └── ocr_recog/ │ │ │ ├── RNN.py │ │ │ ├── RecCTCHead.py │ │ │ ├── RecModel.py │ │ │ ├── RecMv1_enhance.py │ │ │ ├── RecSVTR.py │ │ │ ├── common.py │ │ │ └── en_dict.txt │ │ ├── autoencoder_rae/ │ │ │ ├── README.md │ │ │ └── train_autoencoder_rae.py │ │ ├── autoencoderkl/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_autoencoderkl.py │ │ ├── colossalai/ │ │ │ ├── README.md │ │ │ ├── inference.py │ │ │ ├── requirement.txt │ │ │ └── train_dreambooth_colossalai.py │ │ ├── consistency_training/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_cm_ct_unconditional.py │ │ ├── control_lora/ │ │ │ ├── README.md │ │ │ └── control_lora.py │ │ ├── controlnet/ │ │ │ └── train_controlnet_webdataset.py │ │ ├── diffusion_dpo/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ ├── train_diffusion_dpo.py │ │ │ └── train_diffusion_dpo_sdxl.py │ │ ├── diffusion_orpo/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ ├── train_diffusion_orpo_sdxl_lora.py │ │ │ └── train_diffusion_orpo_sdxl_lora_wds.py │ │ ├── dreambooth_inpaint/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ ├── train_dreambooth_inpaint.py │ │ │ └── train_dreambooth_inpaint_lora.py │ │ ├── flux_lora_quantization/ │ │ │ ├── README.md │ │ │ ├── accelerate.yaml │ │ │ ├── compute_embeddings.py │ │ │ ├── ds2.yaml │ │ │ └── train_dreambooth_lora_flux_miniature.py │ │ ├── geodiff/ │ │ │ ├── README.md │ │ │ └── geodiff_molecule_conformation.ipynb │ │ ├── gligen/ │ │ │ ├── README.md │ │ │ ├── dataset.py │ │ │ ├── demo.ipynb │ │ │ ├── make_datasets.py │ │ │ ├── requirements.txt │ │ │ └── train_gligen_text.py │ │ ├── instructpix2pix_lora/ │ │ │ ├── README.md │ │ │ └── train_instruct_pix2pix_lora.py │ │ ├── intel_opts/ │ │ │ ├── README.md │ │ │ ├── inference_bf16.py │ │ │ ├── textual_inversion/ │ │ │ │ ├── README.md │ │ │ │ ├── requirements.txt │ │ │ │ └── textual_inversion_bf16.py │ │ │ └── textual_inversion_dfq/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ ├── text2images.py │ │ │ └── textual_inversion.py │ │ ├── ip_adapter/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ ├── tutorial_train_faceid.py │ │ │ ├── tutorial_train_ip-adapter.py │ │ │ ├── tutorial_train_plus.py │ │ │ └── tutorial_train_sdxl.py │ │ ├── lora/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_text_to_image_lora.py │ │ ├── lpl/ │ │ │ ├── README.md │ │ │ ├── lpl_loss.py │ │ │ └── train_sdxl_lpl.py │ │ ├── multi_subject_dreambooth/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_multi_subject_dreambooth.py │ │ ├── multi_subject_dreambooth_inpainting/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_multi_subject_dreambooth_inpainting.py │ │ ├── multi_token_textual_inversion/ │ │ │ ├── README.md │ │ │ ├── multi_token_clip.py │ │ │ ├── requirements.txt │ │ │ ├── requirements_flax.txt │ │ │ ├── textual_inversion.py │ │ │ └── textual_inversion_flax.py │ │ ├── onnxruntime/ │ │ │ ├── README.md │ │ │ ├── text_to_image/ │ │ │ │ ├── README.md │ │ │ │ ├── requirements.txt │ │ │ │ └── train_text_to_image.py │ │ │ ├── textual_inversion/ │ │ │ │ ├── README.md │ │ │ │ ├── requirements.txt │ │ │ │ └── textual_inversion.py │ │ │ └── unconditional_image_generation/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_unconditional.py │ │ ├── pixart/ │ │ │ ├── .gitignore │ │ │ ├── controlnet_pixart_alpha.py │ │ │ ├── pipeline_pixart_alpha_controlnet.py │ │ │ ├── requirements.txt │ │ │ ├── run_pixart_alpha_controlnet_pipeline.py │ │ │ ├── train_controlnet_hf_diffusers.sh │ │ │ └── train_pixart_controlnet_hf.py │ │ ├── promptdiffusion/ │ │ │ ├── README.md │ │ │ ├── convert_original_promptdiffusion_to_diffusers.py │ │ │ ├── pipeline_prompt_diffusion.py │ │ │ └── promptdiffusioncontrolnet.py │ │ ├── pytorch_xla/ │ │ │ ├── inference/ │ │ │ │ └── flux/ │ │ │ │ ├── README.md │ │ │ │ └── flux_inference.py │ │ │ └── training/ │ │ │ └── text_to_image/ │ │ │ ├── README.md │ │ │ ├── requirements.txt │ │ │ └── train_text_to_image_xla.py │ │ ├── rdm/ │ │ │ ├── README.md │ │ │ ├── pipeline_rdm.py │ │ │ └── retriever.py │ │ ├── realfill/ │ │ │ ├── README.md │ │ │ ├── infer.py │ │ │ ├── requirements.txt │ │ │ └── train_realfill.py │ │ ├── sana/ │ │ │ ├── README.md │ │ │ ├── train_sana_sprint_diffusers.py │ │ │ └── train_sana_sprint_diffusers.sh │ │ ├── scheduled_huber_loss_training/ │ │ │ ├── README.md │ │ │ ├── dreambooth/ │ │ │ │ ├── train_dreambooth.py │ │ │ │ ├── train_dreambooth_lora.py │ │ │ │ └── train_dreambooth_lora_sdxl.py │ │ │ └── text_to_image/ │ │ │ ├── train_text_to_image.py │ │ │ ├── train_text_to_image_lora.py │ │ │ ├── train_text_to_image_lora_sdxl.py │ │ │ └── train_text_to_image_sdxl.py │ │ ├── sd3_lora_colab/ │ │ │ ├── README.md │ │ │ ├── compute_embeddings.py │ │ │ ├── sd3_dreambooth_lora_16gb.ipynb │ │ │ └── train_dreambooth_lora_sd3_miniature.py │ │ ├── sdxl_flax/ │ │ │ ├── README.md │ │ │ ├── sdxl_single.py │ │ │ └── sdxl_single_aot.py │ │ ├── vae/ │ │ │ ├── README.md │ │ │ └── vae_roundtrip.py │ │ └── wuerstchen/ │ │ └── text_to_image/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── modeling_efficient_net_encoder.py │ │ ├── requirements.txt │ │ ├── train_text_to_image_lora_prior.py │ │ └── train_text_to_image_prior.py │ ├── server/ │ │ ├── README.md │ │ ├── requirements.in │ │ ├── requirements.txt │ │ └── server.py │ ├── server-async/ │ │ ├── Pipelines.py │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── serverasync.py │ │ ├── test.py │ │ └── utils/ │ │ ├── __init__.py │ │ ├── requestscopedpipeline.py │ │ ├── scheduler.py │ │ ├── utils.py │ │ └── wrappers.py │ ├── t2i_adapter/ │ │ ├── README.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── test_t2i_adapter.py │ │ └── train_t2i_adapter_sdxl.py │ ├── test_examples_utils.py │ ├── text_to_image/ │ │ ├── README.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── requirements_flax.txt │ │ ├── requirements_sdxl.txt │ │ ├── test_text_to_image.py │ │ ├── test_text_to_image_lora.py │ │ ├── train_text_to_image.py │ │ ├── train_text_to_image_flax.py │ │ ├── train_text_to_image_lora.py │ │ ├── train_text_to_image_lora_sdxl.py │ │ └── train_text_to_image_sdxl.py │ ├── textual_inversion/ │ │ ├── README.md │ │ ├── README_sdxl.md │ │ ├── requirements.txt │ │ ├── requirements_flax.txt │ │ ├── test_textual_inversion.py │ │ ├── test_textual_inversion_sdxl.py │ │ ├── textual_inversion.py │ │ ├── textual_inversion_flax.py │ │ └── textual_inversion_sdxl.py │ ├── unconditional_image_generation/ │ │ ├── README.md │ │ ├── requirements.txt │ │ ├── test_unconditional.py │ │ └── train_unconditional.py │ └── vqgan/ │ ├── README.md │ ├── discriminator.py │ ├── requirements.txt │ ├── test_vqgan.py │ └── train_vqgan.py ├── pyproject.toml ├── scripts/ │ ├── __init__.py │ ├── change_naming_configs_and_checkpoints.py │ ├── conversion_ldm_uncond.py │ ├── convert_amused.py │ ├── convert_animatediff_motion_lora_to_diffusers.py │ ├── convert_animatediff_motion_module_to_diffusers.py │ ├── convert_animatediff_sparsectrl_to_diffusers.py │ ├── convert_asymmetric_vqgan_to_diffusers.py │ ├── convert_aura_flow_to_diffusers.py │ ├── convert_blipdiffusion_to_diffusers.py │ ├── convert_cogvideox_to_diffusers.py │ ├── convert_cogview3_to_diffusers.py │ ├── convert_cogview4_to_diffusers.py │ ├── convert_cogview4_to_diffusers_megatron.py │ ├── convert_consistency_decoder.py │ ├── convert_consistency_to_diffusers.py │ ├── convert_cosmos_to_diffusers.py │ ├── convert_dance_diffusion_to_diffusers.py │ ├── convert_dcae_to_diffusers.py │ ├── convert_ddpm_original_checkpoint_to_diffusers.py │ ├── convert_diffusers_sdxl_lora_to_webui.py │ ├── convert_diffusers_to_original_sdxl.py │ ├── convert_diffusers_to_original_stable_diffusion.py │ ├── convert_dit_to_diffusers.py │ ├── convert_flux2_to_diffusers.py │ ├── convert_flux_to_diffusers.py │ ├── convert_flux_xlabs_ipadapter_to_diffusers.py │ ├── convert_gligen_to_diffusers.py │ ├── convert_hunyuan_image_to_diffusers.py │ ├── convert_hunyuan_video1_5_to_diffusers.py │ ├── convert_hunyuan_video_to_diffusers.py │ ├── convert_hunyuandit_controlnet_to_diffusers.py │ ├── convert_hunyuandit_to_diffusers.py │ ├── convert_i2vgen_to_diffusers.py │ ├── convert_if.py │ ├── convert_k_upscaler_to_diffusers.py │ ├── convert_kakao_brain_unclip_to_diffusers.py │ ├── convert_kandinsky3_unet.py │ ├── convert_kandinsky_to_diffusers.py │ ├── convert_ldm_original_checkpoint_to_diffusers.py │ ├── convert_lora_safetensor_to_diffusers.py │ ├── convert_ltx2_to_diffusers.py │ ├── convert_ltx_to_diffusers.py │ ├── convert_lumina_to_diffusers.py │ ├── convert_mochi_to_diffusers.py │ ├── convert_models_diffuser_to_diffusers.py │ ├── convert_ms_text_to_video_to_diffusers.py │ ├── convert_music_spectrogram_to_diffusers.py │ ├── convert_ncsnpp_original_checkpoint_to_diffusers.py │ ├── convert_omnigen_to_diffusers.py │ ├── convert_original_audioldm2_to_diffusers.py │ ├── convert_original_audioldm_to_diffusers.py │ ├── convert_original_controlnet_to_diffusers.py │ ├── convert_original_musicldm_to_diffusers.py │ ├── convert_original_stable_diffusion_to_diffusers.py │ ├── convert_original_t2i_adapter.py │ ├── convert_ovis_image_to_diffusers.py │ ├── convert_pixart_alpha_to_diffusers.py │ ├── convert_pixart_sigma_to_diffusers.py │ ├── convert_prx_to_diffusers.py │ ├── convert_rae_to_diffusers.py │ ├── convert_sana_controlnet_to_diffusers.py │ ├── convert_sana_to_diffusers.py │ ├── convert_sana_video_to_diffusers.py │ ├── convert_sd3_controlnet_to_diffusers.py │ ├── convert_sd3_to_diffusers.py │ ├── convert_shap_e_to_diffusers.py │ ├── convert_skyreelsv2_to_diffusers.py │ ├── convert_stable_audio.py │ ├── convert_stable_cascade.py │ ├── convert_stable_cascade_lite.py │ ├── convert_stable_diffusion_checkpoint_to_onnx.py │ ├── convert_stable_diffusion_controlnet_to_onnx.py │ ├── convert_stable_diffusion_controlnet_to_tensorrt.py │ ├── convert_svd_to_diffusers.py │ ├── convert_tiny_autoencoder_to_diffusers.py │ ├── convert_unclip_txt2img_to_image_variation.py │ ├── convert_unidiffuser_to_diffusers.py │ ├── convert_vae_diff_to_onnx.py │ ├── convert_vae_pt_to_diffusers.py │ ├── convert_versatile_diffusion_to_diffusers.py │ ├── convert_vq_diffusion_to_diffusers.py │ ├── convert_wan_to_diffusers.py │ ├── convert_wuerstchen.py │ ├── convert_zero123_to_diffusers.py │ ├── extract_lora_from_model.py │ └── generate_logits.py ├── setup.py ├── src/ │ └── diffusers/ │ ├── __init__.py │ ├── callbacks.py │ ├── commands/ │ │ ├── __init__.py │ │ ├── custom_blocks.py │ │ ├── diffusers_cli.py │ │ ├── env.py │ │ └── fp16_safetensors.py │ ├── configuration_utils.py │ ├── dependency_versions_check.py │ ├── dependency_versions_table.py │ ├── experimental/ │ │ ├── README.md │ │ ├── __init__.py │ │ └── rl/ │ │ ├── __init__.py │ │ └── value_guided_sampling.py │ ├── guiders/ │ │ ├── __init__.py │ │ ├── adaptive_projected_guidance.py │ │ ├── adaptive_projected_guidance_mix.py │ │ ├── auto_guidance.py │ │ ├── classifier_free_guidance.py │ │ ├── classifier_free_zero_star_guidance.py │ │ ├── frequency_decoupled_guidance.py │ │ ├── guider_utils.py │ │ ├── magnitude_aware_guidance.py │ │ ├── perturbed_attention_guidance.py │ │ ├── skip_layer_guidance.py │ │ ├── smoothed_energy_guidance.py │ │ └── tangential_classifier_free_guidance.py │ ├── hooks/ │ │ ├── __init__.py │ │ ├── _common.py │ │ ├── _helpers.py │ │ ├── context_parallel.py │ │ ├── faster_cache.py │ │ ├── first_block_cache.py │ │ ├── group_offloading.py │ │ ├── hooks.py │ │ ├── layer_skip.py │ │ ├── layerwise_casting.py │ │ ├── mag_cache.py │ │ ├── pyramid_attention_broadcast.py │ │ ├── smoothed_energy_guidance_utils.py │ │ ├── taylorseer_cache.py │ │ └── utils.py │ ├── image_processor.py │ ├── loaders/ │ │ ├── __init__.py │ │ ├── ip_adapter.py │ │ ├── lora_base.py │ │ ├── lora_conversion_utils.py │ │ ├── lora_pipeline.py │ │ ├── peft.py │ │ ├── single_file.py │ │ ├── single_file_model.py │ │ ├── single_file_utils.py │ │ ├── textual_inversion.py │ │ ├── transformer_flux.py │ │ ├── transformer_sd3.py │ │ ├── unet.py │ │ ├── unet_loader_utils.py │ │ └── utils.py │ ├── models/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── _modeling_parallel.py │ │ ├── activations.py │ │ ├── adapter.py │ │ ├── attention.py │ │ ├── attention_dispatch.py │ │ ├── attention_flax.py │ │ ├── attention_processor.py │ │ ├── auto_model.py │ │ ├── autoencoders/ │ │ │ ├── __init__.py │ │ │ ├── autoencoder_asym_kl.py │ │ │ ├── autoencoder_dc.py │ │ │ ├── autoencoder_kl.py │ │ │ ├── autoencoder_kl_allegro.py │ │ │ ├── autoencoder_kl_cogvideox.py │ │ │ ├── autoencoder_kl_cosmos.py │ │ │ ├── autoencoder_kl_flux2.py │ │ │ ├── autoencoder_kl_hunyuan_video.py │ │ │ ├── autoencoder_kl_hunyuanimage.py │ │ │ ├── autoencoder_kl_hunyuanimage_refiner.py │ │ │ ├── autoencoder_kl_hunyuanvideo15.py │ │ │ ├── autoencoder_kl_ltx.py │ │ │ ├── autoencoder_kl_ltx2.py │ │ │ ├── autoencoder_kl_ltx2_audio.py │ │ │ ├── autoencoder_kl_magvit.py │ │ │ ├── autoencoder_kl_mochi.py │ │ │ ├── autoencoder_kl_qwenimage.py │ │ │ ├── autoencoder_kl_temporal_decoder.py │ │ │ ├── autoencoder_kl_wan.py │ │ │ ├── autoencoder_oobleck.py │ │ │ ├── autoencoder_rae.py │ │ │ ├── autoencoder_tiny.py │ │ │ ├── autoencoder_vidtok.py │ │ │ ├── consistency_decoder_vae.py │ │ │ ├── vae.py │ │ │ └── vq_model.py │ │ ├── cache_utils.py │ │ ├── controlnets/ │ │ │ ├── __init__.py │ │ │ ├── controlnet.py │ │ │ ├── controlnet_cosmos.py │ │ │ ├── controlnet_flax.py │ │ │ ├── controlnet_flux.py │ │ │ ├── controlnet_hunyuan.py │ │ │ ├── controlnet_qwenimage.py │ │ │ ├── controlnet_sana.py │ │ │ ├── controlnet_sd3.py │ │ │ ├── controlnet_sparsectrl.py │ │ │ ├── controlnet_union.py │ │ │ ├── controlnet_xs.py │ │ │ ├── controlnet_z_image.py │ │ │ ├── multicontrolnet.py │ │ │ └── multicontrolnet_union.py │ │ ├── downsampling.py │ │ ├── embeddings.py │ │ ├── embeddings_flax.py │ │ ├── lora.py │ │ ├── model_loading_utils.py │ │ ├── modeling_flax_pytorch_utils.py │ │ ├── modeling_flax_utils.py │ │ ├── modeling_outputs.py │ │ ├── modeling_pytorch_flax_utils.py │ │ ├── modeling_utils.py │ │ ├── normalization.py │ │ ├── resnet.py │ │ ├── resnet_flax.py │ │ ├── transformers/ │ │ │ ├── __init__.py │ │ │ ├── auraflow_transformer_2d.py │ │ │ ├── cogvideox_transformer_3d.py │ │ │ ├── consisid_transformer_3d.py │ │ │ ├── dit_transformer_2d.py │ │ │ ├── dual_transformer_2d.py │ │ │ ├── hunyuan_transformer_2d.py │ │ │ ├── latte_transformer_3d.py │ │ │ ├── lumina_nextdit2d.py │ │ │ ├── pixart_transformer_2d.py │ │ │ ├── prior_transformer.py │ │ │ ├── sana_transformer.py │ │ │ ├── stable_audio_transformer.py │ │ │ ├── t5_film_transformer.py │ │ │ ├── transformer_2d.py │ │ │ ├── transformer_allegro.py │ │ │ ├── transformer_bria.py │ │ │ ├── transformer_bria_fibo.py │ │ │ ├── transformer_chroma.py │ │ │ ├── transformer_chronoedit.py │ │ │ ├── transformer_cogview3plus.py │ │ │ ├── transformer_cogview4.py │ │ │ ├── transformer_cosmos.py │ │ │ ├── transformer_easyanimate.py │ │ │ ├── transformer_flux.py │ │ │ ├── transformer_flux2.py │ │ │ ├── transformer_glm_image.py │ │ │ ├── transformer_helios.py │ │ │ ├── transformer_hidream_image.py │ │ │ ├── transformer_hunyuan_video.py │ │ │ ├── transformer_hunyuan_video15.py │ │ │ ├── transformer_hunyuan_video_framepack.py │ │ │ ├── transformer_hunyuanimage.py │ │ │ ├── transformer_kandinsky.py │ │ │ ├── transformer_longcat_image.py │ │ │ ├── transformer_ltx.py │ │ │ ├── transformer_ltx2.py │ │ │ ├── transformer_lumina2.py │ │ │ ├── transformer_mochi.py │ │ │ ├── transformer_omnigen.py │ │ │ ├── transformer_ovis_image.py │ │ │ ├── transformer_prx.py │ │ │ ├── transformer_qwenimage.py │ │ │ ├── transformer_sana_video.py │ │ │ ├── transformer_sd3.py │ │ │ ├── transformer_skyreels_v2.py │ │ │ ├── transformer_temporal.py │ │ │ ├── transformer_wan.py │ │ │ ├── transformer_wan_animate.py │ │ │ ├── transformer_wan_vace.py │ │ │ └── transformer_z_image.py │ │ ├── unets/ │ │ │ ├── __init__.py │ │ │ ├── unet_1d.py │ │ │ ├── unet_1d_blocks.py │ │ │ ├── unet_2d.py │ │ │ ├── unet_2d_blocks.py │ │ │ ├── unet_2d_blocks_flax.py │ │ │ ├── unet_2d_condition.py │ │ │ ├── unet_2d_condition_flax.py │ │ │ ├── unet_3d_blocks.py │ │ │ ├── unet_3d_condition.py │ │ │ ├── unet_i2vgen_xl.py │ │ │ ├── unet_kandinsky3.py │ │ │ ├── unet_motion_model.py │ │ │ ├── unet_spatio_temporal_condition.py │ │ │ ├── unet_stable_cascade.py │ │ │ └── uvit_2d.py │ │ ├── upsampling.py │ │ ├── vae_flax.py │ │ └── vq_model.py │ ├── modular_pipelines/ │ │ ├── __init__.py │ │ ├── components_manager.py │ │ ├── flux/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── inputs.py │ │ │ ├── modular_blocks_flux.py │ │ │ ├── modular_blocks_flux_kontext.py │ │ │ └── modular_pipeline.py │ │ ├── flux2/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── inputs.py │ │ │ ├── modular_blocks_flux2.py │ │ │ ├── modular_blocks_flux2_klein.py │ │ │ ├── modular_blocks_flux2_klein_base.py │ │ │ └── modular_pipeline.py │ │ ├── helios/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── modular_blocks_helios.py │ │ │ ├── modular_blocks_helios_pyramid.py │ │ │ ├── modular_blocks_helios_pyramid_distilled.py │ │ │ └── modular_pipeline.py │ │ ├── mellon_node_utils.py │ │ ├── modular_pipeline.py │ │ ├── modular_pipeline_utils.py │ │ ├── qwenimage/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── inputs.py │ │ │ ├── modular_blocks_qwenimage.py │ │ │ ├── modular_blocks_qwenimage_edit.py │ │ │ ├── modular_blocks_qwenimage_edit_plus.py │ │ │ ├── modular_blocks_qwenimage_layered.py │ │ │ ├── modular_pipeline.py │ │ │ └── prompt_templates.py │ │ ├── stable_diffusion_xl/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── modular_blocks_stable_diffusion_xl.py │ │ │ └── modular_pipeline.py │ │ ├── wan/ │ │ │ ├── __init__.py │ │ │ ├── before_denoise.py │ │ │ ├── decoders.py │ │ │ ├── denoise.py │ │ │ ├── encoders.py │ │ │ ├── modular_blocks_wan.py │ │ │ ├── modular_blocks_wan22.py │ │ │ ├── modular_blocks_wan22_i2v.py │ │ │ ├── modular_blocks_wan_i2v.py │ │ │ └── modular_pipeline.py │ │ └── z_image/ │ │ ├── __init__.py │ │ ├── before_denoise.py │ │ ├── decoders.py │ │ ├── denoise.py │ │ ├── encoders.py │ │ ├── modular_blocks_z_image.py │ │ └── modular_pipeline.py │ ├── optimization.py │ ├── pipelines/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── allegro/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_allegro.py │ │ │ └── pipeline_output.py │ │ ├── amused/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_amused.py │ │ │ ├── pipeline_amused_img2img.py │ │ │ └── pipeline_amused_inpaint.py │ │ ├── animatediff/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_animatediff.py │ │ │ ├── pipeline_animatediff_controlnet.py │ │ │ ├── pipeline_animatediff_sdxl.py │ │ │ ├── pipeline_animatediff_sparsectrl.py │ │ │ ├── pipeline_animatediff_video2video.py │ │ │ ├── pipeline_animatediff_video2video_controlnet.py │ │ │ └── pipeline_output.py │ │ ├── audioldm/ │ │ │ ├── __init__.py │ │ │ └── pipeline_audioldm.py │ │ ├── audioldm2/ │ │ │ ├── __init__.py │ │ │ ├── modeling_audioldm2.py │ │ │ └── pipeline_audioldm2.py │ │ ├── aura_flow/ │ │ │ ├── __init__.py │ │ │ └── pipeline_aura_flow.py │ │ ├── auto_pipeline.py │ │ ├── blip_diffusion/ │ │ │ ├── __init__.py │ │ │ ├── blip_image_processing.py │ │ │ ├── modeling_blip2.py │ │ │ ├── modeling_ctx_clip.py │ │ │ └── pipeline_blip_diffusion.py │ │ ├── bria/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_bria.py │ │ │ └── pipeline_output.py │ │ ├── bria_fibo/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_bria_fibo.py │ │ │ ├── pipeline_bria_fibo_edit.py │ │ │ └── pipeline_output.py │ │ ├── chroma/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_chroma.py │ │ │ ├── pipeline_chroma_img2img.py │ │ │ ├── pipeline_chroma_inpainting.py │ │ │ └── pipeline_output.py │ │ ├── chronoedit/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_chronoedit.py │ │ │ └── pipeline_output.py │ │ ├── cogvideo/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_cogvideox.py │ │ │ ├── pipeline_cogvideox_fun_control.py │ │ │ ├── pipeline_cogvideox_image2video.py │ │ │ ├── pipeline_cogvideox_video2video.py │ │ │ └── pipeline_output.py │ │ ├── cogview3/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_cogview3plus.py │ │ │ └── pipeline_output.py │ │ ├── cogview4/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_cogview4.py │ │ │ ├── pipeline_cogview4_control.py │ │ │ └── pipeline_output.py │ │ ├── consisid/ │ │ │ ├── __init__.py │ │ │ ├── consisid_utils.py │ │ │ ├── pipeline_consisid.py │ │ │ └── pipeline_output.py │ │ ├── consistency_models/ │ │ │ ├── __init__.py │ │ │ └── pipeline_consistency_models.py │ │ ├── controlnet/ │ │ │ ├── __init__.py │ │ │ ├── multicontrolnet.py │ │ │ ├── pipeline_controlnet.py │ │ │ ├── pipeline_controlnet_blip_diffusion.py │ │ │ ├── pipeline_controlnet_img2img.py │ │ │ ├── pipeline_controlnet_inpaint.py │ │ │ ├── pipeline_controlnet_inpaint_sd_xl.py │ │ │ ├── pipeline_controlnet_sd_xl.py │ │ │ ├── pipeline_controlnet_sd_xl_img2img.py │ │ │ ├── pipeline_controlnet_union_inpaint_sd_xl.py │ │ │ ├── pipeline_controlnet_union_sd_xl.py │ │ │ ├── pipeline_controlnet_union_sd_xl_img2img.py │ │ │ └── pipeline_flax_controlnet.py │ │ ├── controlnet_hunyuandit/ │ │ │ ├── __init__.py │ │ │ └── pipeline_hunyuandit_controlnet.py │ │ ├── controlnet_sd3/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_stable_diffusion_3_controlnet.py │ │ │ └── pipeline_stable_diffusion_3_controlnet_inpainting.py │ │ ├── controlnet_xs/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_controlnet_xs.py │ │ │ └── pipeline_controlnet_xs_sd_xl.py │ │ ├── cosmos/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_cosmos2_5_predict.py │ │ │ ├── pipeline_cosmos2_5_transfer.py │ │ │ ├── pipeline_cosmos2_text2image.py │ │ │ ├── pipeline_cosmos2_video2world.py │ │ │ ├── pipeline_cosmos_text2world.py │ │ │ ├── pipeline_cosmos_video2world.py │ │ │ └── pipeline_output.py │ │ ├── dance_diffusion/ │ │ │ ├── __init__.py │ │ │ └── pipeline_dance_diffusion.py │ │ ├── ddim/ │ │ │ ├── __init__.py │ │ │ └── pipeline_ddim.py │ │ ├── ddpm/ │ │ │ ├── __init__.py │ │ │ └── pipeline_ddpm.py │ │ ├── deepfloyd_if/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_if.py │ │ │ ├── pipeline_if_img2img.py │ │ │ ├── pipeline_if_img2img_superresolution.py │ │ │ ├── pipeline_if_inpainting.py │ │ │ ├── pipeline_if_inpainting_superresolution.py │ │ │ ├── pipeline_if_superresolution.py │ │ │ ├── pipeline_output.py │ │ │ ├── safety_checker.py │ │ │ ├── timesteps.py │ │ │ └── watermark.py │ │ ├── deprecated/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── alt_diffusion/ │ │ │ │ ├── __init__.py │ │ │ │ ├── modeling_roberta_series.py │ │ │ │ ├── pipeline_alt_diffusion.py │ │ │ │ ├── pipeline_alt_diffusion_img2img.py │ │ │ │ └── pipeline_output.py │ │ │ ├── audio_diffusion/ │ │ │ │ ├── __init__.py │ │ │ │ ├── mel.py │ │ │ │ └── pipeline_audio_diffusion.py │ │ │ ├── latent_diffusion_uncond/ │ │ │ │ ├── __init__.py │ │ │ │ └── pipeline_latent_diffusion_uncond.py │ │ │ ├── pndm/ │ │ │ │ ├── __init__.py │ │ │ │ └── pipeline_pndm.py │ │ │ ├── repaint/ │ │ │ │ ├── __init__.py │ │ │ │ └── pipeline_repaint.py │ │ │ ├── score_sde_ve/ │ │ │ │ ├── __init__.py │ │ │ │ └── pipeline_score_sde_ve.py │ │ │ ├── spectrogram_diffusion/ │ │ │ │ ├── __init__.py │ │ │ │ ├── continuous_encoder.py │ │ │ │ ├── midi_utils.py │ │ │ │ ├── notes_encoder.py │ │ │ │ └── pipeline_spectrogram_diffusion.py │ │ │ ├── stable_diffusion_variants/ │ │ │ │ ├── __init__.py │ │ │ │ ├── pipeline_cycle_diffusion.py │ │ │ │ ├── pipeline_onnx_stable_diffusion_inpaint_legacy.py │ │ │ │ ├── pipeline_stable_diffusion_inpaint_legacy.py │ │ │ │ ├── pipeline_stable_diffusion_model_editing.py │ │ │ │ ├── pipeline_stable_diffusion_paradigms.py │ │ │ │ └── pipeline_stable_diffusion_pix2pix_zero.py │ │ │ ├── stochastic_karras_ve/ │ │ │ │ ├── __init__.py │ │ │ │ └── pipeline_stochastic_karras_ve.py │ │ │ ├── versatile_diffusion/ │ │ │ │ ├── __init__.py │ │ │ │ ├── modeling_text_unet.py │ │ │ │ ├── pipeline_versatile_diffusion.py │ │ │ │ ├── pipeline_versatile_diffusion_dual_guided.py │ │ │ │ ├── pipeline_versatile_diffusion_image_variation.py │ │ │ │ └── pipeline_versatile_diffusion_text_to_image.py │ │ │ └── vq_diffusion/ │ │ │ ├── __init__.py │ │ │ └── pipeline_vq_diffusion.py │ │ ├── dit/ │ │ │ ├── __init__.py │ │ │ └── pipeline_dit.py │ │ ├── easyanimate/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_easyanimate.py │ │ │ ├── pipeline_easyanimate_control.py │ │ │ ├── pipeline_easyanimate_inpaint.py │ │ │ └── pipeline_output.py │ │ ├── flux/ │ │ │ ├── __init__.py │ │ │ ├── modeling_flux.py │ │ │ ├── pipeline_flux.py │ │ │ ├── pipeline_flux_control.py │ │ │ ├── pipeline_flux_control_img2img.py │ │ │ ├── pipeline_flux_control_inpaint.py │ │ │ ├── pipeline_flux_controlnet.py │ │ │ ├── pipeline_flux_controlnet_image_to_image.py │ │ │ ├── pipeline_flux_controlnet_inpainting.py │ │ │ ├── pipeline_flux_fill.py │ │ │ ├── pipeline_flux_img2img.py │ │ │ ├── pipeline_flux_inpaint.py │ │ │ ├── pipeline_flux_kontext.py │ │ │ ├── pipeline_flux_kontext_inpaint.py │ │ │ ├── pipeline_flux_prior_redux.py │ │ │ └── pipeline_output.py │ │ ├── flux2/ │ │ │ ├── __init__.py │ │ │ ├── image_processor.py │ │ │ ├── pipeline_flux2.py │ │ │ ├── pipeline_flux2_klein.py │ │ │ ├── pipeline_flux2_klein_kv.py │ │ │ ├── pipeline_output.py │ │ │ └── system_messages.py │ │ ├── free_init_utils.py │ │ ├── free_noise_utils.py │ │ ├── glm_image/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_glm_image.py │ │ │ └── pipeline_output.py │ │ ├── helios/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_helios.py │ │ │ ├── pipeline_helios_pyramid.py │ │ │ └── pipeline_output.py │ │ ├── hidream_image/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_hidream_image.py │ │ │ └── pipeline_output.py │ │ ├── hunyuan_image/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_hunyuanimage.py │ │ │ ├── pipeline_hunyuanimage_refiner.py │ │ │ └── pipeline_output.py │ │ ├── hunyuan_video/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_hunyuan_skyreels_image2video.py │ │ │ ├── pipeline_hunyuan_video.py │ │ │ ├── pipeline_hunyuan_video_framepack.py │ │ │ ├── pipeline_hunyuan_video_image2video.py │ │ │ └── pipeline_output.py │ │ ├── hunyuan_video1_5/ │ │ │ ├── __init__.py │ │ │ ├── image_processor.py │ │ │ ├── pipeline_hunyuan_video1_5.py │ │ │ ├── pipeline_hunyuan_video1_5_image2video.py │ │ │ └── pipeline_output.py │ │ ├── hunyuandit/ │ │ │ ├── __init__.py │ │ │ └── pipeline_hunyuandit.py │ │ ├── i2vgen_xl/ │ │ │ ├── __init__.py │ │ │ └── pipeline_i2vgen_xl.py │ │ ├── kandinsky/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_kandinsky.py │ │ │ ├── pipeline_kandinsky_combined.py │ │ │ ├── pipeline_kandinsky_img2img.py │ │ │ ├── pipeline_kandinsky_inpaint.py │ │ │ ├── pipeline_kandinsky_prior.py │ │ │ └── text_encoder.py │ │ ├── kandinsky2_2/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_kandinsky2_2.py │ │ │ ├── pipeline_kandinsky2_2_combined.py │ │ │ ├── pipeline_kandinsky2_2_controlnet.py │ │ │ ├── pipeline_kandinsky2_2_controlnet_img2img.py │ │ │ ├── pipeline_kandinsky2_2_img2img.py │ │ │ ├── pipeline_kandinsky2_2_inpainting.py │ │ │ ├── pipeline_kandinsky2_2_prior.py │ │ │ └── pipeline_kandinsky2_2_prior_emb2emb.py │ │ ├── kandinsky3/ │ │ │ ├── __init__.py │ │ │ ├── convert_kandinsky3_unet.py │ │ │ ├── pipeline_kandinsky3.py │ │ │ └── pipeline_kandinsky3_img2img.py │ │ ├── kandinsky5/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_kandinsky.py │ │ │ ├── pipeline_kandinsky_i2i.py │ │ │ ├── pipeline_kandinsky_i2v.py │ │ │ ├── pipeline_kandinsky_t2i.py │ │ │ └── pipeline_output.py │ │ ├── kolors/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_kolors.py │ │ │ ├── pipeline_kolors_img2img.py │ │ │ ├── pipeline_output.py │ │ │ ├── text_encoder.py │ │ │ └── tokenizer.py │ │ ├── latent_consistency_models/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_latent_consistency_img2img.py │ │ │ └── pipeline_latent_consistency_text2img.py │ │ ├── latent_diffusion/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_latent_diffusion.py │ │ │ └── pipeline_latent_diffusion_superresolution.py │ │ ├── latte/ │ │ │ ├── __init__.py │ │ │ └── pipeline_latte.py │ │ ├── ledits_pp/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_leditspp_stable_diffusion.py │ │ │ ├── pipeline_leditspp_stable_diffusion_xl.py │ │ │ └── pipeline_output.py │ │ ├── longcat_image/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_longcat_image.py │ │ │ ├── pipeline_longcat_image_edit.py │ │ │ ├── pipeline_output.py │ │ │ └── system_messages.py │ │ ├── ltx/ │ │ │ ├── __init__.py │ │ │ ├── modeling_latent_upsampler.py │ │ │ ├── pipeline_ltx.py │ │ │ ├── pipeline_ltx_condition.py │ │ │ ├── pipeline_ltx_i2v_long_multi_prompt.py │ │ │ ├── pipeline_ltx_image2video.py │ │ │ ├── pipeline_ltx_latent_upsample.py │ │ │ └── pipeline_output.py │ │ ├── ltx2/ │ │ │ ├── __init__.py │ │ │ ├── connectors.py │ │ │ ├── export_utils.py │ │ │ ├── latent_upsampler.py │ │ │ ├── pipeline_ltx2.py │ │ │ ├── pipeline_ltx2_condition.py │ │ │ ├── pipeline_ltx2_image2video.py │ │ │ ├── pipeline_ltx2_latent_upsample.py │ │ │ ├── pipeline_output.py │ │ │ ├── utils.py │ │ │ └── vocoder.py │ │ ├── lucy/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_lucy_edit.py │ │ │ └── pipeline_output.py │ │ ├── lumina/ │ │ │ ├── __init__.py │ │ │ └── pipeline_lumina.py │ │ ├── lumina2/ │ │ │ ├── __init__.py │ │ │ └── pipeline_lumina2.py │ │ ├── marigold/ │ │ │ ├── __init__.py │ │ │ ├── marigold_image_processing.py │ │ │ ├── pipeline_marigold_depth.py │ │ │ ├── pipeline_marigold_intrinsics.py │ │ │ └── pipeline_marigold_normals.py │ │ ├── mochi/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_mochi.py │ │ │ └── pipeline_output.py │ │ ├── musicldm/ │ │ │ ├── __init__.py │ │ │ └── pipeline_musicldm.py │ │ ├── omnigen/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_omnigen.py │ │ │ └── processor_omnigen.py │ │ ├── onnx_utils.py │ │ ├── ovis_image/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ └── pipeline_ovis_image.py │ │ ├── pag/ │ │ │ ├── __init__.py │ │ │ ├── pag_utils.py │ │ │ ├── pipeline_pag_controlnet_sd.py │ │ │ ├── pipeline_pag_controlnet_sd_inpaint.py │ │ │ ├── pipeline_pag_controlnet_sd_xl.py │ │ │ ├── pipeline_pag_controlnet_sd_xl_img2img.py │ │ │ ├── pipeline_pag_hunyuandit.py │ │ │ ├── pipeline_pag_kolors.py │ │ │ ├── pipeline_pag_pixart_sigma.py │ │ │ ├── pipeline_pag_sana.py │ │ │ ├── pipeline_pag_sd.py │ │ │ ├── pipeline_pag_sd_3.py │ │ │ ├── pipeline_pag_sd_3_img2img.py │ │ │ ├── pipeline_pag_sd_animatediff.py │ │ │ ├── pipeline_pag_sd_img2img.py │ │ │ ├── pipeline_pag_sd_inpaint.py │ │ │ ├── pipeline_pag_sd_xl.py │ │ │ ├── pipeline_pag_sd_xl_img2img.py │ │ │ └── pipeline_pag_sd_xl_inpaint.py │ │ ├── paint_by_example/ │ │ │ ├── __init__.py │ │ │ ├── image_encoder.py │ │ │ └── pipeline_paint_by_example.py │ │ ├── pia/ │ │ │ ├── __init__.py │ │ │ └── pipeline_pia.py │ │ ├── pipeline_flax_utils.py │ │ ├── pipeline_loading_utils.py │ │ ├── pipeline_utils.py │ │ ├── pixart_alpha/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_pixart_alpha.py │ │ │ └── pipeline_pixart_sigma.py │ │ ├── prx/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ └── pipeline_prx.py │ │ ├── qwenimage/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_qwenimage.py │ │ │ ├── pipeline_qwenimage_controlnet.py │ │ │ ├── pipeline_qwenimage_controlnet_inpaint.py │ │ │ ├── pipeline_qwenimage_edit.py │ │ │ ├── pipeline_qwenimage_edit_inpaint.py │ │ │ ├── pipeline_qwenimage_edit_plus.py │ │ │ ├── pipeline_qwenimage_img2img.py │ │ │ ├── pipeline_qwenimage_inpaint.py │ │ │ └── pipeline_qwenimage_layered.py │ │ ├── sana/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_sana.py │ │ │ ├── pipeline_sana_controlnet.py │ │ │ ├── pipeline_sana_sprint.py │ │ │ └── pipeline_sana_sprint_img2img.py │ │ ├── sana_video/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_sana_video.py │ │ │ └── pipeline_sana_video_i2v.py │ │ ├── semantic_stable_diffusion/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ └── pipeline_semantic_stable_diffusion.py │ │ ├── shap_e/ │ │ │ ├── __init__.py │ │ │ ├── camera.py │ │ │ ├── pipeline_shap_e.py │ │ │ ├── pipeline_shap_e_img2img.py │ │ │ └── renderer.py │ │ ├── skyreels_v2/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_skyreels_v2.py │ │ │ ├── pipeline_skyreels_v2_diffusion_forcing.py │ │ │ ├── pipeline_skyreels_v2_diffusion_forcing_i2v.py │ │ │ ├── pipeline_skyreels_v2_diffusion_forcing_v2v.py │ │ │ └── pipeline_skyreels_v2_i2v.py │ │ ├── stable_audio/ │ │ │ ├── __init__.py │ │ │ ├── modeling_stable_audio.py │ │ │ └── pipeline_stable_audio.py │ │ ├── stable_cascade/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_stable_cascade.py │ │ │ ├── pipeline_stable_cascade_combined.py │ │ │ └── pipeline_stable_cascade_prior.py │ │ ├── stable_diffusion/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── clip_image_project_model.py │ │ │ ├── convert_from_ckpt.py │ │ │ ├── pipeline_flax_stable_diffusion.py │ │ │ ├── pipeline_flax_stable_diffusion_img2img.py │ │ │ ├── pipeline_flax_stable_diffusion_inpaint.py │ │ │ ├── pipeline_onnx_stable_diffusion.py │ │ │ ├── pipeline_onnx_stable_diffusion_img2img.py │ │ │ ├── pipeline_onnx_stable_diffusion_inpaint.py │ │ │ ├── pipeline_onnx_stable_diffusion_upscale.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_stable_diffusion.py │ │ │ ├── pipeline_stable_diffusion_depth2img.py │ │ │ ├── pipeline_stable_diffusion_image_variation.py │ │ │ ├── pipeline_stable_diffusion_img2img.py │ │ │ ├── pipeline_stable_diffusion_inpaint.py │ │ │ ├── pipeline_stable_diffusion_instruct_pix2pix.py │ │ │ ├── pipeline_stable_diffusion_latent_upscale.py │ │ │ ├── pipeline_stable_diffusion_upscale.py │ │ │ ├── pipeline_stable_unclip.py │ │ │ ├── pipeline_stable_unclip_img2img.py │ │ │ ├── safety_checker.py │ │ │ ├── safety_checker_flax.py │ │ │ └── stable_unclip_image_normalizer.py │ │ ├── stable_diffusion_3/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_stable_diffusion_3.py │ │ │ ├── pipeline_stable_diffusion_3_img2img.py │ │ │ └── pipeline_stable_diffusion_3_inpaint.py │ │ ├── stable_diffusion_attend_and_excite/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_diffusion_attend_and_excite.py │ │ ├── stable_diffusion_diffedit/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_diffusion_diffedit.py │ │ ├── stable_diffusion_gligen/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_stable_diffusion_gligen.py │ │ │ └── pipeline_stable_diffusion_gligen_text_image.py │ │ ├── stable_diffusion_ldm3d/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_diffusion_ldm3d.py │ │ ├── stable_diffusion_panorama/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_diffusion_panorama.py │ │ ├── stable_diffusion_safe/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_stable_diffusion_safe.py │ │ │ └── safety_checker.py │ │ ├── stable_diffusion_sag/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_diffusion_sag.py │ │ ├── stable_diffusion_xl/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_flax_stable_diffusion_xl.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_stable_diffusion_xl.py │ │ │ ├── pipeline_stable_diffusion_xl_img2img.py │ │ │ ├── pipeline_stable_diffusion_xl_inpaint.py │ │ │ ├── pipeline_stable_diffusion_xl_instruct_pix2pix.py │ │ │ └── watermark.py │ │ ├── stable_video_diffusion/ │ │ │ ├── __init__.py │ │ │ └── pipeline_stable_video_diffusion.py │ │ ├── t2i_adapter/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_stable_diffusion_adapter.py │ │ │ └── pipeline_stable_diffusion_xl_adapter.py │ │ ├── text_to_video_synthesis/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_text_to_video_synth.py │ │ │ ├── pipeline_text_to_video_synth_img2img.py │ │ │ ├── pipeline_text_to_video_zero.py │ │ │ └── pipeline_text_to_video_zero_sdxl.py │ │ ├── transformers_loading_utils.py │ │ ├── unclip/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_unclip.py │ │ │ ├── pipeline_unclip_image_variation.py │ │ │ └── text_proj.py │ │ ├── unidiffuser/ │ │ │ ├── __init__.py │ │ │ ├── modeling_text_decoder.py │ │ │ ├── modeling_uvit.py │ │ │ └── pipeline_unidiffuser.py │ │ ├── visualcloze/ │ │ │ ├── __init__.py │ │ │ ├── pipeline_visualcloze_combined.py │ │ │ ├── pipeline_visualcloze_generation.py │ │ │ └── visualcloze_utils.py │ │ ├── wan/ │ │ │ ├── __init__.py │ │ │ ├── image_processor.py │ │ │ ├── pipeline_output.py │ │ │ ├── pipeline_wan.py │ │ │ ├── pipeline_wan_animate.py │ │ │ ├── pipeline_wan_i2v.py │ │ │ ├── pipeline_wan_vace.py │ │ │ └── pipeline_wan_video2video.py │ │ ├── wuerstchen/ │ │ │ ├── __init__.py │ │ │ ├── modeling_paella_vq_model.py │ │ │ ├── modeling_wuerstchen_common.py │ │ │ ├── modeling_wuerstchen_diffnext.py │ │ │ ├── modeling_wuerstchen_prior.py │ │ │ ├── pipeline_wuerstchen.py │ │ │ ├── pipeline_wuerstchen_combined.py │ │ │ └── pipeline_wuerstchen_prior.py │ │ └── z_image/ │ │ ├── __init__.py │ │ ├── pipeline_output.py │ │ ├── pipeline_z_image.py │ │ ├── pipeline_z_image_controlnet.py │ │ ├── pipeline_z_image_controlnet_inpaint.py │ │ ├── pipeline_z_image_img2img.py │ │ ├── pipeline_z_image_inpaint.py │ │ └── pipeline_z_image_omni.py │ ├── py.typed │ ├── quantizers/ │ │ ├── __init__.py │ │ ├── auto.py │ │ ├── base.py │ │ ├── bitsandbytes/ │ │ │ ├── __init__.py │ │ │ ├── bnb_quantizer.py │ │ │ └── utils.py │ │ ├── gguf/ │ │ │ ├── __init__.py │ │ │ ├── gguf_quantizer.py │ │ │ └── utils.py │ │ ├── modelopt/ │ │ │ ├── __init__.py │ │ │ └── modelopt_quantizer.py │ │ ├── pipe_quant_config.py │ │ ├── quantization_config.py │ │ ├── quanto/ │ │ │ ├── __init__.py │ │ │ ├── quanto_quantizer.py │ │ │ └── utils.py │ │ └── torchao/ │ │ ├── __init__.py │ │ └── torchao_quantizer.py │ ├── schedulers/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── deprecated/ │ │ │ ├── __init__.py │ │ │ ├── scheduling_karras_ve.py │ │ │ └── scheduling_sde_vp.py │ │ ├── scheduling_amused.py │ │ ├── scheduling_consistency_decoder.py │ │ ├── scheduling_consistency_models.py │ │ ├── scheduling_cosine_dpmsolver_multistep.py │ │ ├── scheduling_ddim.py │ │ ├── scheduling_ddim_cogvideox.py │ │ ├── scheduling_ddim_flax.py │ │ ├── scheduling_ddim_inverse.py │ │ ├── scheduling_ddim_parallel.py │ │ ├── scheduling_ddpm.py │ │ ├── scheduling_ddpm_flax.py │ │ ├── scheduling_ddpm_parallel.py │ │ ├── scheduling_ddpm_wuerstchen.py │ │ ├── scheduling_deis_multistep.py │ │ ├── scheduling_dpm_cogvideox.py │ │ ├── scheduling_dpmsolver_multistep.py │ │ ├── scheduling_dpmsolver_multistep_flax.py │ │ ├── scheduling_dpmsolver_multistep_inverse.py │ │ ├── scheduling_dpmsolver_sde.py │ │ ├── scheduling_dpmsolver_singlestep.py │ │ ├── scheduling_edm_dpmsolver_multistep.py │ │ ├── scheduling_edm_euler.py │ │ ├── scheduling_euler_ancestral_discrete.py │ │ ├── scheduling_euler_discrete.py │ │ ├── scheduling_euler_discrete_flax.py │ │ ├── scheduling_flow_match_euler_discrete.py │ │ ├── scheduling_flow_match_heun_discrete.py │ │ ├── scheduling_flow_match_lcm.py │ │ ├── scheduling_helios.py │ │ ├── scheduling_helios_dmd.py │ │ ├── scheduling_heun_discrete.py │ │ ├── scheduling_ipndm.py │ │ ├── scheduling_k_dpm_2_ancestral_discrete.py │ │ ├── scheduling_k_dpm_2_discrete.py │ │ ├── scheduling_karras_ve_flax.py │ │ ├── scheduling_lcm.py │ │ ├── scheduling_lms_discrete.py │ │ ├── scheduling_lms_discrete_flax.py │ │ ├── scheduling_ltx_euler_ancestral_rf.py │ │ ├── scheduling_pndm.py │ │ ├── scheduling_pndm_flax.py │ │ ├── scheduling_repaint.py │ │ ├── scheduling_sasolver.py │ │ ├── scheduling_scm.py │ │ ├── scheduling_sde_ve.py │ │ ├── scheduling_sde_ve_flax.py │ │ ├── scheduling_tcd.py │ │ ├── scheduling_unclip.py │ │ ├── scheduling_unipc_multistep.py │ │ ├── scheduling_utils.py │ │ ├── scheduling_utils_flax.py │ │ └── scheduling_vq_diffusion.py │ ├── training_utils.py │ ├── utils/ │ │ ├── __init__.py │ │ ├── accelerate_utils.py │ │ ├── constants.py │ │ ├── deprecation_utils.py │ │ ├── distributed_utils.py │ │ ├── doc_utils.py │ │ ├── dummy_bitsandbytes_objects.py │ │ ├── dummy_flax_and_transformers_objects.py │ │ ├── dummy_flax_objects.py │ │ ├── dummy_gguf_objects.py │ │ ├── dummy_note_seq_objects.py │ │ ├── dummy_nvidia_modelopt_objects.py │ │ ├── dummy_onnx_objects.py │ │ ├── dummy_optimum_quanto_objects.py │ │ ├── dummy_pt_objects.py │ │ ├── dummy_torch_and_librosa_objects.py │ │ ├── dummy_torch_and_scipy_objects.py │ │ ├── dummy_torch_and_torchsde_objects.py │ │ ├── dummy_torch_and_transformers_and_onnx_objects.py │ │ ├── dummy_torch_and_transformers_and_opencv_objects.py │ │ ├── dummy_torch_and_transformers_and_sentencepiece_objects.py │ │ ├── dummy_torch_and_transformers_objects.py │ │ ├── dummy_torchao_objects.py │ │ ├── dummy_transformers_and_torch_and_note_seq_objects.py │ │ ├── dynamic_modules_utils.py │ │ ├── export_utils.py │ │ ├── hub_utils.py │ │ ├── import_utils.py │ │ ├── loading_utils.py │ │ ├── logging.py │ │ ├── model_card_template.md │ │ ├── outputs.py │ │ ├── peft_utils.py │ │ ├── pil_utils.py │ │ ├── remote_utils.py │ │ ├── source_code_parsing_utils.py │ │ ├── state_dict_utils.py │ │ ├── testing_utils.py │ │ ├── torch_utils.py │ │ ├── typing_utils.py │ │ └── versions.py │ └── video_processor.py ├── tests/ │ ├── __init__.py │ ├── conftest.py │ ├── fixtures/ │ │ ├── custom_pipeline/ │ │ │ ├── pipeline.py │ │ │ └── what_ever.py │ │ └── elise_format0.mid │ ├── hooks/ │ │ ├── __init__.py │ │ ├── test_group_offloading.py │ │ ├── test_hooks.py │ │ └── test_mag_cache.py │ ├── lora/ │ │ ├── __init__.py │ │ ├── test_lora_layers_auraflow.py │ │ ├── test_lora_layers_cogvideox.py │ │ ├── test_lora_layers_cogview4.py │ │ ├── test_lora_layers_flux.py │ │ ├── test_lora_layers_flux2.py │ │ ├── test_lora_layers_helios.py │ │ ├── test_lora_layers_hunyuanvideo.py │ │ ├── test_lora_layers_ltx2.py │ │ ├── test_lora_layers_ltx_video.py │ │ ├── test_lora_layers_lumina2.py │ │ ├── test_lora_layers_mochi.py │ │ ├── test_lora_layers_qwenimage.py │ │ ├── test_lora_layers_sana.py │ │ ├── test_lora_layers_sd.py │ │ ├── test_lora_layers_sd3.py │ │ ├── test_lora_layers_sdxl.py │ │ ├── test_lora_layers_wan.py │ │ ├── test_lora_layers_wanvace.py │ │ ├── test_lora_layers_z_image.py │ │ └── utils.py │ ├── models/ │ │ ├── __init__.py │ │ ├── autoencoders/ │ │ │ ├── __init__.py │ │ │ ├── test_models_asymmetric_autoencoder_kl.py │ │ │ ├── test_models_autoencoder_cosmos.py │ │ │ ├── test_models_autoencoder_dc.py │ │ │ ├── test_models_autoencoder_hunyuan_video.py │ │ │ ├── test_models_autoencoder_kl.py │ │ │ ├── test_models_autoencoder_kl_cogvideox.py │ │ │ ├── test_models_autoencoder_kl_ltx2_audio.py │ │ │ ├── test_models_autoencoder_kl_temporal_decoder.py │ │ │ ├── test_models_autoencoder_ltx2_video.py │ │ │ ├── test_models_autoencoder_ltx_video.py │ │ │ ├── test_models_autoencoder_magvit.py │ │ │ ├── test_models_autoencoder_mochi.py │ │ │ ├── test_models_autoencoder_oobleck.py │ │ │ ├── test_models_autoencoder_rae.py │ │ │ ├── test_models_autoencoder_tiny.py │ │ │ ├── test_models_autoencoder_vidtok.py │ │ │ ├── test_models_autoencoder_wan.py │ │ │ ├── test_models_consistency_decoder_vae.py │ │ │ ├── test_models_vq.py │ │ │ ├── testing_utils.py │ │ │ └── vae.py │ │ ├── controlnets/ │ │ │ ├── __init__.py │ │ │ └── test_models_controlnet_cosmos.py │ │ ├── test_activations.py │ │ ├── test_attention_processor.py │ │ ├── test_layers_utils.py │ │ ├── test_modeling_common.py │ │ ├── test_models_auto.py │ │ ├── testing_utils/ │ │ │ ├── __init__.py │ │ │ ├── attention.py │ │ │ ├── cache.py │ │ │ ├── common.py │ │ │ ├── compile.py │ │ │ ├── ip_adapter.py │ │ │ ├── lora.py │ │ │ ├── memory.py │ │ │ ├── parallelism.py │ │ │ ├── quantization.py │ │ │ ├── single_file.py │ │ │ └── training.py │ │ ├── transformers/ │ │ │ ├── __init__.py │ │ │ ├── test_models_dit_transformer2d.py │ │ │ ├── test_models_pixart_transformer2d.py │ │ │ ├── test_models_prior.py │ │ │ ├── test_models_transformer_allegro.py │ │ │ ├── test_models_transformer_aura_flow.py │ │ │ ├── test_models_transformer_bria.py │ │ │ ├── test_models_transformer_bria_fibo.py │ │ │ ├── test_models_transformer_chroma.py │ │ │ ├── test_models_transformer_cogvideox.py │ │ │ ├── test_models_transformer_cogview3plus.py │ │ │ ├── test_models_transformer_cogview4.py │ │ │ ├── test_models_transformer_consisid.py │ │ │ ├── test_models_transformer_cosmos.py │ │ │ ├── test_models_transformer_easyanimate.py │ │ │ ├── test_models_transformer_flux.py │ │ │ ├── test_models_transformer_flux2.py │ │ │ ├── test_models_transformer_helios.py │ │ │ ├── test_models_transformer_hidream.py │ │ │ ├── test_models_transformer_hunyuan_1_5.py │ │ │ ├── test_models_transformer_hunyuan_dit.py │ │ │ ├── test_models_transformer_hunyuan_video.py │ │ │ ├── test_models_transformer_hunyuan_video_framepack.py │ │ │ ├── test_models_transformer_latte.py │ │ │ ├── test_models_transformer_ltx.py │ │ │ ├── test_models_transformer_ltx2.py │ │ │ ├── test_models_transformer_lumina.py │ │ │ ├── test_models_transformer_lumina2.py │ │ │ ├── test_models_transformer_mochi.py │ │ │ ├── test_models_transformer_omnigen.py │ │ │ ├── test_models_transformer_prx.py │ │ │ ├── test_models_transformer_qwenimage.py │ │ │ ├── test_models_transformer_sana.py │ │ │ ├── test_models_transformer_sana_video.py │ │ │ ├── test_models_transformer_sd3.py │ │ │ ├── test_models_transformer_skyreels_v2.py │ │ │ ├── test_models_transformer_temporal.py │ │ │ ├── test_models_transformer_wan.py │ │ │ ├── test_models_transformer_wan_animate.py │ │ │ ├── test_models_transformer_wan_vace.py │ │ │ └── test_models_transformer_z_image.py │ │ └── unets/ │ │ ├── __init__.py │ │ ├── test_models_unet_1d.py │ │ ├── test_models_unet_2d.py │ │ ├── test_models_unet_2d_condition.py │ │ ├── test_models_unet_3d_condition.py │ │ ├── test_models_unet_controlnetxs.py │ │ ├── test_models_unet_motion.py │ │ ├── test_models_unet_spatiotemporal.py │ │ ├── test_unet_2d_blocks.py │ │ └── test_unet_blocks_common.py │ ├── modular_pipelines/ │ │ ├── __init__.py │ │ ├── flux/ │ │ │ ├── __init__.py │ │ │ └── test_modular_pipeline_flux.py │ │ ├── flux2/ │ │ │ ├── __init__.py │ │ │ ├── test_modular_pipeline_flux2.py │ │ │ ├── test_modular_pipeline_flux2_klein.py │ │ │ └── test_modular_pipeline_flux2_klein_base.py │ │ ├── helios/ │ │ │ ├── __init__.py │ │ │ └── test_modular_pipeline_helios.py │ │ ├── qwen/ │ │ │ ├── __init__.py │ │ │ └── test_modular_pipeline_qwenimage.py │ │ ├── stable_diffusion_xl/ │ │ │ ├── __init__.py │ │ │ └── test_modular_pipeline_stable_diffusion_xl.py │ │ ├── test_modular_pipelines_common.py │ │ ├── test_modular_pipelines_custom_blocks.py │ │ ├── wan/ │ │ │ ├── __init__.py │ │ │ └── test_modular_pipeline_wan.py │ │ └── z_image/ │ │ ├── __init__.py │ │ └── test_modular_pipeline_z_image.py │ ├── others/ │ │ ├── __init__.py │ │ ├── test_attention_backends.py │ │ ├── test_check_copies.py │ │ ├── test_check_dummies.py │ │ ├── test_check_support_list.py │ │ ├── test_config.py │ │ ├── test_dependencies.py │ │ ├── test_ema.py │ │ ├── test_hub_utils.py │ │ ├── test_image_processor.py │ │ ├── test_outputs.py │ │ ├── test_training.py │ │ ├── test_utils.py │ │ └── test_video_processor.py │ ├── pipelines/ │ │ ├── __init__.py │ │ ├── allegro/ │ │ │ ├── __init__.py │ │ │ └── test_allegro.py │ │ ├── animatediff/ │ │ │ ├── __init__.py │ │ │ ├── test_animatediff.py │ │ │ ├── test_animatediff_controlnet.py │ │ │ ├── test_animatediff_sdxl.py │ │ │ ├── test_animatediff_sparsectrl.py │ │ │ ├── test_animatediff_video2video.py │ │ │ └── test_animatediff_video2video_controlnet.py │ │ ├── audioldm2/ │ │ │ ├── __init__.py │ │ │ └── test_audioldm2.py │ │ ├── aura_flow/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_aura_flow.py │ │ ├── bria/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_bria.py │ │ ├── bria_fibo/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_bria_fibo.py │ │ ├── bria_fibo_edit/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_bria_fibo_edit.py │ │ ├── chroma/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline_chroma.py │ │ │ └── test_pipeline_chroma_img2img.py │ │ ├── chronoedit/ │ │ │ ├── __init__.py │ │ │ └── test_chronoedit.py │ │ ├── cogvideo/ │ │ │ ├── __init__.py │ │ │ ├── test_cogvideox.py │ │ │ ├── test_cogvideox_fun_control.py │ │ │ ├── test_cogvideox_image2video.py │ │ │ └── test_cogvideox_video2video.py │ │ ├── cogview3/ │ │ │ ├── __init__.py │ │ │ └── test_cogview3plus.py │ │ ├── cogview4/ │ │ │ ├── __init__.py │ │ │ └── test_cogview4.py │ │ ├── consisid/ │ │ │ ├── __init__.py │ │ │ └── test_consisid.py │ │ ├── consistency_models/ │ │ │ ├── __init__.py │ │ │ └── test_consistency_models.py │ │ ├── controlnet/ │ │ │ ├── __init__.py │ │ │ ├── test_controlnet.py │ │ │ ├── test_controlnet_img2img.py │ │ │ ├── test_controlnet_inpaint.py │ │ │ ├── test_controlnet_inpaint_sdxl.py │ │ │ ├── test_controlnet_sdxl.py │ │ │ └── test_controlnet_sdxl_img2img.py │ │ ├── controlnet_flux/ │ │ │ ├── __init__.py │ │ │ ├── test_controlnet_flux.py │ │ │ ├── test_controlnet_flux_img2img.py │ │ │ └── test_controlnet_flux_inpaint.py │ │ ├── controlnet_hunyuandit/ │ │ │ ├── __init__.py │ │ │ └── test_controlnet_hunyuandit.py │ │ ├── controlnet_sd3/ │ │ │ ├── __init__.py │ │ │ ├── test_controlnet_inpaint_sd3.py │ │ │ └── test_controlnet_sd3.py │ │ ├── cosmos/ │ │ │ ├── __init__.py │ │ │ ├── cosmos_guardrail.py │ │ │ ├── test_cosmos.py │ │ │ ├── test_cosmos2_5_predict.py │ │ │ ├── test_cosmos2_5_transfer.py │ │ │ ├── test_cosmos2_text2image.py │ │ │ ├── test_cosmos2_video2world.py │ │ │ └── test_cosmos_video2world.py │ │ ├── ddim/ │ │ │ ├── __init__.py │ │ │ └── test_ddim.py │ │ ├── ddpm/ │ │ │ ├── __init__.py │ │ │ └── test_ddpm.py │ │ ├── deepfloyd_if/ │ │ │ ├── __init__.py │ │ │ ├── test_if.py │ │ │ ├── test_if_img2img.py │ │ │ ├── test_if_img2img_superresolution.py │ │ │ ├── test_if_inpainting.py │ │ │ ├── test_if_inpainting_superresolution.py │ │ │ └── test_if_superresolution.py │ │ ├── dit/ │ │ │ ├── __init__.py │ │ │ └── test_dit.py │ │ ├── easyanimate/ │ │ │ ├── __init__.py │ │ │ └── test_easyanimate.py │ │ ├── flux/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline_flux.py │ │ │ ├── test_pipeline_flux_control.py │ │ │ ├── test_pipeline_flux_control_img2img.py │ │ │ ├── test_pipeline_flux_control_inpaint.py │ │ │ ├── test_pipeline_flux_fill.py │ │ │ ├── test_pipeline_flux_img2img.py │ │ │ ├── test_pipeline_flux_inpaint.py │ │ │ ├── test_pipeline_flux_kontext.py │ │ │ ├── test_pipeline_flux_kontext_inpaint.py │ │ │ └── test_pipeline_flux_redux.py │ │ ├── flux2/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline_flux2.py │ │ │ ├── test_pipeline_flux2_klein.py │ │ │ └── test_pipeline_flux2_klein_kv.py │ │ ├── glm_image/ │ │ │ ├── __init__.py │ │ │ └── test_glm_image.py │ │ ├── helios/ │ │ │ ├── __init__.py │ │ │ └── test_helios.py │ │ ├── hidream_image/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_hidream.py │ │ ├── hunyuan_image_21/ │ │ │ ├── __init__.py │ │ │ └── test_hunyuanimage.py │ │ ├── hunyuan_video/ │ │ │ ├── __init__.py │ │ │ ├── test_hunyuan_image2video.py │ │ │ ├── test_hunyuan_skyreels_image2video.py │ │ │ ├── test_hunyuan_video.py │ │ │ └── test_hunyuan_video_framepack.py │ │ ├── hunyuan_video1_5/ │ │ │ ├── __init__.py │ │ │ └── test_hunyuan_1_5.py │ │ ├── hunyuandit/ │ │ │ ├── __init__.py │ │ │ └── test_hunyuan_dit.py │ │ ├── ip_adapters/ │ │ │ ├── __init__.py │ │ │ └── test_ip_adapter_stable_diffusion.py │ │ ├── kandinsky/ │ │ │ ├── __init__.py │ │ │ ├── test_kandinsky.py │ │ │ ├── test_kandinsky_combined.py │ │ │ ├── test_kandinsky_img2img.py │ │ │ ├── test_kandinsky_inpaint.py │ │ │ └── test_kandinsky_prior.py │ │ ├── kandinsky2_2/ │ │ │ ├── __init__.py │ │ │ ├── test_kandinsky.py │ │ │ ├── test_kandinsky_combined.py │ │ │ ├── test_kandinsky_controlnet.py │ │ │ ├── test_kandinsky_controlnet_img2img.py │ │ │ ├── test_kandinsky_img2img.py │ │ │ ├── test_kandinsky_inpaint.py │ │ │ ├── test_kandinsky_prior.py │ │ │ └── test_kandinsky_prior_emb2emb.py │ │ ├── kandinsky3/ │ │ │ ├── __init__.py │ │ │ ├── test_kandinsky3.py │ │ │ └── test_kandinsky3_img2img.py │ │ ├── kandinsky5/ │ │ │ ├── __init__.py │ │ │ ├── test_kandinsky5.py │ │ │ ├── test_kandinsky5_i2i.py │ │ │ ├── test_kandinsky5_i2v.py │ │ │ └── test_kandinsky5_t2i.py │ │ ├── kolors/ │ │ │ ├── __init__.py │ │ │ ├── test_kolors.py │ │ │ └── test_kolors_img2img.py │ │ ├── latent_consistency_models/ │ │ │ ├── __init__.py │ │ │ ├── test_latent_consistency_models.py │ │ │ └── test_latent_consistency_models_img2img.py │ │ ├── latent_diffusion/ │ │ │ ├── __init__.py │ │ │ ├── test_latent_diffusion.py │ │ │ └── test_latent_diffusion_superresolution.py │ │ ├── latte/ │ │ │ ├── __init__.py │ │ │ └── test_latte.py │ │ ├── ledits_pp/ │ │ │ ├── __init__.py │ │ │ ├── test_ledits_pp_stable_diffusion.py │ │ │ └── test_ledits_pp_stable_diffusion_xl.py │ │ ├── longcat_image/ │ │ │ └── __init__.py │ │ ├── ltx/ │ │ │ ├── __init__.py │ │ │ ├── test_ltx.py │ │ │ ├── test_ltx_condition.py │ │ │ ├── test_ltx_image2video.py │ │ │ └── test_ltx_latent_upsample.py │ │ ├── ltx2/ │ │ │ ├── __init__.py │ │ │ ├── test_ltx2.py │ │ │ └── test_ltx2_image2video.py │ │ ├── lumina/ │ │ │ ├── __init__.py │ │ │ └── test_lumina_nextdit.py │ │ ├── lumina2/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_lumina2.py │ │ ├── marigold/ │ │ │ ├── __init__.py │ │ │ ├── test_marigold_depth.py │ │ │ ├── test_marigold_intrinsics.py │ │ │ └── test_marigold_normals.py │ │ ├── mochi/ │ │ │ ├── __init__.py │ │ │ └── test_mochi.py │ │ ├── omnigen/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_omnigen.py │ │ ├── ovis_image/ │ │ │ └── __init__.py │ │ ├── pag/ │ │ │ ├── __init__.py │ │ │ ├── test_pag_animatediff.py │ │ │ ├── test_pag_controlnet_sd.py │ │ │ ├── test_pag_controlnet_sd_inpaint.py │ │ │ ├── test_pag_controlnet_sdxl.py │ │ │ ├── test_pag_controlnet_sdxl_img2img.py │ │ │ ├── test_pag_hunyuan_dit.py │ │ │ ├── test_pag_kolors.py │ │ │ ├── test_pag_pixart_sigma.py │ │ │ ├── test_pag_sana.py │ │ │ ├── test_pag_sd.py │ │ │ ├── test_pag_sd3.py │ │ │ ├── test_pag_sd3_img2img.py │ │ │ ├── test_pag_sd_img2img.py │ │ │ ├── test_pag_sd_inpaint.py │ │ │ ├── test_pag_sdxl.py │ │ │ ├── test_pag_sdxl_img2img.py │ │ │ └── test_pag_sdxl_inpaint.py │ │ ├── pipeline_params.py │ │ ├── pixart_alpha/ │ │ │ ├── __init__.py │ │ │ └── test_pixart.py │ │ ├── pixart_sigma/ │ │ │ ├── __init__.py │ │ │ └── test_pixart.py │ │ ├── pndm/ │ │ │ ├── __init__.py │ │ │ └── test_pndm.py │ │ ├── prx/ │ │ │ ├── __init__.py │ │ │ └── test_pipeline_prx.py │ │ ├── qwenimage/ │ │ │ ├── __init__.py │ │ │ ├── test_qwenimage.py │ │ │ ├── test_qwenimage_controlnet.py │ │ │ ├── test_qwenimage_edit.py │ │ │ ├── test_qwenimage_edit_plus.py │ │ │ ├── test_qwenimage_img2img.py │ │ │ └── test_qwenimage_inpaint.py │ │ ├── sana/ │ │ │ ├── __init__.py │ │ │ ├── test_sana.py │ │ │ ├── test_sana_controlnet.py │ │ │ ├── test_sana_sprint.py │ │ │ └── test_sana_sprint_img2img.py │ │ ├── sana_video/ │ │ │ ├── __init__.py │ │ │ ├── test_sana_video.py │ │ │ └── test_sana_video_i2v.py │ │ ├── shap_e/ │ │ │ ├── __init__.py │ │ │ ├── test_shap_e.py │ │ │ └── test_shap_e_img2img.py │ │ ├── skyreels_v2/ │ │ │ ├── __init__.py │ │ │ ├── test_skyreels_v2.py │ │ │ ├── test_skyreels_v2_df.py │ │ │ ├── test_skyreels_v2_df_image_to_video.py │ │ │ ├── test_skyreels_v2_df_video_to_video.py │ │ │ └── test_skyreels_v2_image_to_video.py │ │ ├── stable_audio/ │ │ │ ├── __init__.py │ │ │ └── test_stable_audio.py │ │ ├── stable_cascade/ │ │ │ ├── __init__.py │ │ │ ├── test_stable_cascade_combined.py │ │ │ ├── test_stable_cascade_decoder.py │ │ │ └── test_stable_cascade_prior.py │ │ ├── stable_diffusion/ │ │ │ ├── __init__.py │ │ │ ├── test_onnx_stable_diffusion.py │ │ │ ├── test_onnx_stable_diffusion_img2img.py │ │ │ ├── test_onnx_stable_diffusion_inpaint.py │ │ │ ├── test_onnx_stable_diffusion_upscale.py │ │ │ ├── test_stable_diffusion.py │ │ │ ├── test_stable_diffusion_img2img.py │ │ │ ├── test_stable_diffusion_inpaint.py │ │ │ └── test_stable_diffusion_instruction_pix2pix.py │ │ ├── stable_diffusion_2/ │ │ │ ├── __init__.py │ │ │ ├── test_stable_diffusion.py │ │ │ ├── test_stable_diffusion_depth.py │ │ │ ├── test_stable_diffusion_inpaint.py │ │ │ ├── test_stable_diffusion_latent_upscale.py │ │ │ ├── test_stable_diffusion_upscale.py │ │ │ └── test_stable_diffusion_v_pred.py │ │ ├── stable_diffusion_3/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline_stable_diffusion_3.py │ │ │ ├── test_pipeline_stable_diffusion_3_img2img.py │ │ │ └── test_pipeline_stable_diffusion_3_inpaint.py │ │ ├── stable_diffusion_adapter/ │ │ │ ├── __init__.py │ │ │ └── test_stable_diffusion_adapter.py │ │ ├── stable_diffusion_image_variation/ │ │ │ ├── __init__.py │ │ │ └── test_stable_diffusion_image_variation.py │ │ ├── stable_diffusion_xl/ │ │ │ ├── __init__.py │ │ │ ├── test_stable_diffusion_xl.py │ │ │ ├── test_stable_diffusion_xl_adapter.py │ │ │ ├── test_stable_diffusion_xl_img2img.py │ │ │ ├── test_stable_diffusion_xl_inpaint.py │ │ │ └── test_stable_diffusion_xl_instruction_pix2pix.py │ │ ├── stable_unclip/ │ │ │ ├── __init__.py │ │ │ ├── test_stable_unclip.py │ │ │ └── test_stable_unclip_img2img.py │ │ ├── stable_video_diffusion/ │ │ │ ├── __init__.py │ │ │ └── test_stable_video_diffusion.py │ │ ├── test_pipeline_utils.py │ │ ├── test_pipelines.py │ │ ├── test_pipelines_auto.py │ │ ├── test_pipelines_combined.py │ │ ├── test_pipelines_common.py │ │ ├── test_pipelines_onnx_common.py │ │ ├── visualcloze/ │ │ │ ├── __init__.py │ │ │ ├── test_pipeline_visualcloze_combined.py │ │ │ └── test_pipeline_visualcloze_generation.py │ │ ├── wan/ │ │ │ ├── __init__.py │ │ │ ├── test_wan.py │ │ │ ├── test_wan_22.py │ │ │ ├── test_wan_22_image_to_video.py │ │ │ ├── test_wan_animate.py │ │ │ ├── test_wan_image_to_video.py │ │ │ ├── test_wan_vace.py │ │ │ └── test_wan_video_to_video.py │ │ └── z_image/ │ │ ├── __init__.py │ │ ├── test_z_image.py │ │ ├── test_z_image_img2img.py │ │ └── test_z_image_inpaint.py │ ├── quantization/ │ │ ├── __init__.py │ │ ├── bnb/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── test_4bit.py │ │ │ └── test_mixed_int8.py │ │ ├── gguf/ │ │ │ ├── __init__.py │ │ │ └── test_gguf.py │ │ ├── modelopt/ │ │ │ ├── __init__.py │ │ │ └── test_modelopt.py │ │ ├── quanto/ │ │ │ ├── __init__.py │ │ │ └── test_quanto.py │ │ ├── test_pipeline_level_quantization.py │ │ ├── test_torch_compile_utils.py │ │ ├── torchao/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ └── test_torchao.py │ │ └── utils.py │ ├── remote/ │ │ ├── __init__.py │ │ ├── test_remote_decode.py │ │ └── test_remote_encode.py │ ├── schedulers/ │ │ ├── __init__.py │ │ ├── test_scheduler_consistency_model.py │ │ ├── test_scheduler_ddim.py │ │ ├── test_scheduler_ddim_inverse.py │ │ ├── test_scheduler_ddim_parallel.py │ │ ├── test_scheduler_ddpm.py │ │ ├── test_scheduler_ddpm_parallel.py │ │ ├── test_scheduler_deis.py │ │ ├── test_scheduler_dpm_multi.py │ │ ├── test_scheduler_dpm_multi_inverse.py │ │ ├── test_scheduler_dpm_sde.py │ │ ├── test_scheduler_dpm_single.py │ │ ├── test_scheduler_edm_dpmsolver_multistep.py │ │ ├── test_scheduler_edm_euler.py │ │ ├── test_scheduler_euler.py │ │ ├── test_scheduler_euler_ancestral.py │ │ ├── test_scheduler_heun.py │ │ ├── test_scheduler_ipndm.py │ │ ├── test_scheduler_kdpm2_ancestral.py │ │ ├── test_scheduler_kdpm2_discrete.py │ │ ├── test_scheduler_lcm.py │ │ ├── test_scheduler_lms.py │ │ ├── test_scheduler_pndm.py │ │ ├── test_scheduler_sasolver.py │ │ ├── test_scheduler_score_sde_ve.py │ │ ├── test_scheduler_tcd.py │ │ ├── test_scheduler_unclip.py │ │ ├── test_scheduler_unipc.py │ │ ├── test_scheduler_vq_diffusion.py │ │ └── test_schedulers.py │ ├── single_file/ │ │ ├── __init__.py │ │ ├── single_file_testing_utils.py │ │ ├── test_lumina2_transformer.py │ │ ├── test_model_autoencoder_dc_single_file.py │ │ ├── test_model_controlnet_single_file.py │ │ ├── test_model_flux_transformer_single_file.py │ │ ├── test_model_motion_adapter_single_file.py │ │ ├── test_model_sd_cascade_unet_single_file.py │ │ ├── test_model_vae_single_file.py │ │ ├── test_model_wan_autoencoder_single_file.py │ │ ├── test_model_wan_transformer3d_single_file.py │ │ ├── test_sana_transformer.py │ │ ├── test_stable_diffusion_controlnet_img2img_single_file.py │ │ ├── test_stable_diffusion_controlnet_inpaint_single_file.py │ │ ├── test_stable_diffusion_controlnet_single_file.py │ │ ├── test_stable_diffusion_img2img_single_file.py │ │ ├── test_stable_diffusion_inpaint_single_file.py │ │ ├── test_stable_diffusion_single_file.py │ │ ├── test_stable_diffusion_upscale_single_file.py │ │ ├── test_stable_diffusion_xl_adapter_single_file.py │ │ ├── test_stable_diffusion_xl_controlnet_single_file.py │ │ ├── test_stable_diffusion_xl_img2img_single_file.py │ │ ├── test_stable_diffusion_xl_instruct_pix2pix.py │ │ └── test_stable_diffusion_xl_single_file.py │ └── testing_utils.py └── utils/ ├── check_config_docstrings.py ├── check_copies.py ├── check_doc_toc.py ├── check_dummies.py ├── check_inits.py ├── check_repo.py ├── check_support_list.py ├── check_table.py ├── consolidated_test_report.py ├── custom_init_isort.py ├── extract_tests_from_mixin.py ├── fetch_latest_release_branch.py ├── fetch_torch_cuda_pipeline_test_matrix.py ├── generate_model_tests.py ├── get_modified_files.py ├── log_reports.py ├── modular_auto_docstring.py ├── notify_benchmarking_status.py ├── notify_community_pipelines_mirror.py ├── notify_slack_about_release.py ├── overwrite_expected_slice.py ├── print_env.py ├── release.py ├── stale.py ├── tests_fetcher.py └── update_metadata.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .ai/AGENTS.md ================================================ # Diffusers — Agent Guide ## Coding style Strive to write code as simple and explicit as possible. - Minimize small helper/utility functions — inline the logic instead. A reader should be able to follow the full flow without jumping between functions. - No defensive code or unused code paths — do not add fallback paths, safety checks, or configuration options "just in case". When porting from a research repo, delete training-time code paths, experimental flags, and ablation branches entirely — only keep the inference path you are actually integrating. - Do not guess user intent and silently correct behavior. Make the expected inputs clear in the docstring, and raise a concise error for unsupported cases rather than adding complex fallback logic. --- ### Dependencies - No new mandatory dependency without discussion (e.g. `einops`) - Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py` ## Code formatting - `make style` and `make fix-copies` should be run as the final step before opening a PR ### Copied Code - Many classes are kept in sync with a source via a `# Copied from ...` header comment - Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source - Remove the header to intentionally break the link ### Models - All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls. - Avoid graph breaks for `torch.compile` compatibility — do not insert NumPy operations in forward implementations and any other patterns that can break `torch.compile` compatibility with `fullgraph=True`. - See the **model-integration** skill for the attention pattern, pipeline rules, test setup instructions, and other important details. ## Skills Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents. Available skills: **model-integration** (adding/converting pipelines), **parity-testing** (debugging numerical parity). ================================================ FILE: .ai/skills/model-integration/SKILL.md ================================================ --- name: integrating-models description: > Use when adding a new model or pipeline to diffusers, setting up file structure for a new model, converting a pipeline to modular format, or converting weights for a new version of an already-supported model. --- ## Goal Integrate a new model into diffusers end-to-end. The overall flow: 1. **Gather info** — ask the user for the reference repo, setup guide, a runnable inference script, and other objectives such as standard vs modular. 2. **Confirm the plan** — once you have everything, tell the user exactly what you'll do: e.g. "I'll integrate model X with pipeline Y into diffusers based on your script. I'll run parity tests (model-level and pipeline-level) using the `parity-testing` skill to verify numerical correctness against the reference." 3. **Implement** — write the diffusers code (model, pipeline, scheduler if needed), convert weights, register in `__init__.py`. 4. **Parity test** — use the `parity-testing` skill to verify component and e2e parity against the reference implementation. 5. **Deliver a unit test** — provide a self-contained test script that runs the diffusers implementation, checks numerical output (np allclose), and saves an image/video for visual verification. This is what the user runs to confirm everything works. Work one workflow at a time — get it to full parity before moving on. ## Setup — gather before starting Before writing any code, gather info in this order: 1. **Reference repo** — ask for the github link. If they've already set it up locally, ask for the path. Otherwise, ask what setup steps are needed (install deps, download checkpoints, set env vars, etc.) and run through them before proceeding. 2. **Inference script** — ask for a runnable end-to-end script for a basic workflow first (e.g. T2V). Then ask what other workflows they want to support (I2V, V2V, etc.) and agree on the full implementation order together. 3. **Standard vs modular** — standard pipelines, modular, or both? Use `AskUserQuestion` with structured choices for step 3 when the options are known. ## Standard Pipeline Integration ### File structure for a new model ``` src/diffusers/ models/transformers/transformer_.py # The core model schedulers/scheduling_.py # If model needs a custom scheduler pipelines// __init__.py pipeline_.py # Main pipeline pipeline__.py # Variant pipelines (e.g. pyramid, distilled) pipeline_output.py # Output dataclass loaders/lora_pipeline.py # LoRA mixin (add to existing file) tests/ models/transformers/test_models_transformer_.py pipelines//test_.py lora/test_lora_layers_.py docs/source/en/api/ pipelines/.md models/_transformer3d.md # or appropriate name ``` ### Integration checklist - [ ] Implement transformer model with `from_pretrained` support - [ ] Implement or reuse scheduler - [ ] Implement pipeline(s) with `__call__` method - [ ] Add LoRA support if applicable - [ ] Register all classes in `__init__.py` files (lazy imports) - [ ] Write unit tests (model, pipeline, LoRA) - [ ] Write docs - [ ] Run `make style` and `make quality` - [ ] Test parity with reference implementation (see `parity-testing` skill) ### Attention pattern Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`. ```python # transformer_mymodel.py class MyModelAttnProcessor: _attention_backend = None _parallel_config = None def __call__(self, attn, hidden_states, attention_mask=None, ...): query = attn.to_q(hidden_states) key = attn.to_k(hidden_states) value = attn.to_v(hidden_states) # reshape, apply rope, etc. hidden_states = dispatch_attention_fn( query, key, value, attn_mask=attention_mask, backend=self._attention_backend, parallel_config=self._parallel_config, ) hidden_states = hidden_states.flatten(2, 3) return attn.to_out[0](hidden_states) class MyModelAttention(nn.Module, AttentionModuleMixin): _default_processor_cls = MyModelAttnProcessor _available_processors = [MyModelAttnProcessor] def __init__(self, query_dim, heads=8, dim_head=64, ...): super().__init__() self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False) self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False) self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False) self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)]) self.set_processor(MyModelAttnProcessor()) def forward(self, hidden_states, attention_mask=None, **kwargs): return self.processor(self, hidden_states, attention_mask, **kwargs) ``` Consult the implementations in `src/diffusers/models/transformers/` if you need further references. ### Implementation rules 1. **Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed. 2. **Pipelines must inherit from `DiffusionPipeline`.** Consult implementations in `src/diffusers/pipelines` in case you need references. 3. **Don't subclass an existing pipeline for a variant.** DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`). ### Test setup - Slow tests gated with `@slow` and `RUN_SLOW=1` - All model-level tests must use the `BaseModelTesterConfig`, `ModelTesterMixin`, `MemoryTesterMixin`, `AttentionTesterMixin`, `LoraTesterMixin`, and `TrainingTesterMixin` classes initially to write the tests. Any additional tests should be added after discussions with the maintainers. Use `tests/models/transformers/test_models_transformer_flux.py` as a reference. ### Common diffusers conventions - Pipelines inherit from `DiffusionPipeline` - Models use `ModelMixin` with `register_to_config` for config serialization - Schedulers use `SchedulerMixin` with `ConfigMixin` - Use `@torch.no_grad()` on pipeline `__call__` - Support `output_type="latent"` for skipping VAE decode - Support `generator` parameter for reproducibility - Use `self.progress_bar(timesteps)` for progress tracking ## Gotchas 1. **Forgetting `__init__.py` lazy imports.** Every new class must be registered in the appropriate `__init__.py` with lazy imports. Missing this causes `ImportError` that only shows up when users try `from diffusers import YourNewClass`. 2. **Using `einops` or other non-PyTorch deps.** Reference implementations often use `einops.rearrange`. Always rewrite with native PyTorch (`reshape`, `permute`, `unflatten`). Don't add the dependency. If a dependency is truly unavoidable, guard its import: `if is_my_dependency_available(): import my_dependency`. 3. **Missing `make fix-copies` after `# Copied from`.** If you add `# Copied from` annotations, you must run `make fix-copies` to propagate them. CI will fail otherwise. 4. **Wrong `_supports_cache_class` / `_no_split_modules`.** These class attributes control KV cache and device placement. Copy from a similar model and verify -- wrong values cause silent correctness bugs or OOM errors. 5. **Missing `@torch.no_grad()` on pipeline `__call__`.** Forgetting this causes GPU OOM from gradient accumulation during inference. 6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value. 7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures. 8. **Hardcoded dtype in model forward.** Don't hardcode `torch.float32` or `torch.bfloat16` in the model's forward pass. Use the dtype of the input tensors or `self.dtype` so the model works with any precision. --- ## Modular Pipeline Conversion See [modular-conversion.md](modular-conversion.md) for the full guide on converting standard pipelines to modular format, including block types, build order, guider abstraction, and conversion checklist. --- ## Weight Conversion Tips ================================================ FILE: .ai/skills/model-integration/modular-conversion.md ================================================ # Modular Pipeline Conversion Reference ## When to use Modular pipelines break a monolithic `__call__` into composable blocks. Convert when: - The model supports multiple workflows (T2V, I2V, V2V, etc.) - Users need to swap guidance strategies (CFG, CFG-Zero*, PAG) - You want to share blocks across pipeline variants ## File structure ``` src/diffusers/modular_pipelines// __init__.py # Lazy imports modular_pipeline.py # Pipeline class (tiny, mostly config) encoders.py # Text encoder + image/video VAE encoder blocks before_denoise.py # Pre-denoise setup blocks denoise.py # The denoising loop blocks decoders.py # VAE decode block modular_blocks_.py # Block assembly (AutoBlocks) ``` ## Block types decision tree ``` Is this a single operation? YES -> ModularPipelineBlocks (leaf block) Does it run multiple blocks in sequence? YES -> SequentialPipelineBlocks Does it iterate (e.g. chunk loop)? YES -> LoopSequentialPipelineBlocks Does it choose ONE block based on which input is present? Is the selection 1:1 with trigger inputs? YES -> AutoPipelineBlocks (simple trigger mapping) NO -> ConditionalPipelineBlocks (custom select_block method) ``` ## Build order (easiest first) 1. `decoders.py` -- Takes latents, runs VAE decode, returns images/videos 2. `encoders.py` -- Takes prompt, returns prompt_embeds. Add image/video VAE encoder if needed 3. `before_denoise.py` -- Timesteps, latent prep, noise setup. Each logical operation = one block 4. `denoise.py` -- The hardest. Convert guidance to guider abstraction ## Key pattern: Guider abstraction Original pipeline has guidance baked in: ```python for i, t in enumerate(timesteps): noise_pred = self.transformer(latents, prompt_embeds, ...) if self.do_classifier_free_guidance: noise_uncond = self.transformer(latents, negative_prompt_embeds, ...) noise_pred = noise_uncond + scale * (noise_pred - noise_uncond) latents = self.scheduler.step(noise_pred, t, latents).prev_sample ``` Modular pipeline separates concerns: ```python guider_inputs = { "encoder_hidden_states": (prompt_embeds, negative_prompt_embeds), } for i, t in enumerate(timesteps): components.guider.set_state(step=i, num_inference_steps=num_steps, timestep=t) guider_state = components.guider.prepare_inputs(guider_inputs) for batch in guider_state: components.guider.prepare_models(components.transformer) cond_kwargs = {k: getattr(batch, k) for k in guider_inputs} context_name = getattr(batch, components.guider._identifier_key) with components.transformer.cache_context(context_name): batch.noise_pred = components.transformer( hidden_states=latents, timestep=timestep, return_dict=False, **cond_kwargs, **shared_kwargs, )[0] components.guider.cleanup_models(components.transformer) noise_pred = components.guider(guider_state)[0] latents = components.scheduler.step(noise_pred, t, latents, generator=generator)[0] ``` ## Key pattern: Chunk loops for video models Use `LoopSequentialPipelineBlocks` for outer loop: ```python class ChunkDenoiseStep(LoopSequentialPipelineBlocks): block_classes = [PrepareChunkStep, NoiseGenStep, DenoiseInnerStep, UpdateStep] ``` Note: blocks inside `LoopSequentialPipelineBlocks` receive `(components, block_state, k)` where `k` is the loop iteration index. ## Key pattern: Workflow selection ```python class AutoDenoise(ConditionalPipelineBlocks): block_classes = [V2VDenoiseStep, I2VDenoiseStep, T2VDenoiseStep] block_trigger_inputs = ["video_latents", "image_latents"] default_block_name = "text2video" ``` ## Standard InputParam/OutputParam templates ```python # Inputs InputParam.template("prompt") # str, required InputParam.template("negative_prompt") # str, optional InputParam.template("image") # PIL.Image, optional InputParam.template("generator") # torch.Generator, optional InputParam.template("num_inference_steps") # int, default=50 InputParam.template("latents") # torch.Tensor, optional # Outputs OutputParam.template("prompt_embeds") OutputParam.template("negative_prompt_embeds") OutputParam.template("image_latents") OutputParam.template("latents") OutputParam.template("videos") OutputParam.template("images") ``` ## ComponentSpec patterns ```python # Heavy models - loaded from pretrained ComponentSpec("transformer", YourTransformerModel) ComponentSpec("vae", AutoencoderKL) # Lightweight objects - created inline from config ComponentSpec( "guider", ClassifierFreeGuidance, config=FrozenDict({"guidance_scale": 7.5}), default_creation_method="from_config" ) ``` ## Conversion checklist - [ ] Read original pipeline's `__call__` end-to-end, map stages - [ ] Write test scripts (reference + target) with identical seeds - [ ] Create file structure under `modular_pipelines//` - [ ] Write decoder block (simplest) - [ ] Write encoder blocks (text, image, video) - [ ] Write before_denoise blocks (timesteps, latent prep, noise) - [ ] Write denoise block with guider abstraction (hardest) - [ ] Create pipeline class with `default_blocks_name` - [ ] Assemble blocks in `modular_blocks_.py` - [ ] Wire up `__init__.py` with lazy imports - [ ] Run `make style` and `make quality` - [ ] Test all workflows for parity with reference ================================================ FILE: .ai/skills/parity-testing/SKILL.md ================================================ --- name: testing-parity description: > Use when debugging or verifying numerical parity between pipeline implementations (e.g., research repo vs diffusers, standard vs modular). Also relevant when outputs look wrong — washed out, pixelated, or have visual artifacts — as these are usually parity bugs. --- ## Setup — gather before starting Before writing any test code, gather: 1. **Which two implementations** are being compared (e.g. research repo → diffusers, standard → modular, or research → modular). Use `AskUserQuestion` with structured choices if not already clear. 2. **Two equivalent runnable scripts** — one for each implementation, both expected to produce identical output given the same inputs. These scripts define what "parity" means concretely. When invoked from the `model-integration` skill, you already have context: the reference script comes from step 2 of setup, and the diffusers script is the one you just wrote. You just need to make sure both scripts are runnable and use the same inputs/seed/params. ## Test strategy **Component parity (CPU/float32) -- always run, as you build.** Test each component before assembling the pipeline. This is the foundation -- if individual pieces are wrong, the pipeline can't be right. Each component in isolation, strict max_diff < 1e-3. Test freshly converted checkpoints and saved checkpoints. - **Fresh**: convert from checkpoint weights, compare against reference (catches conversion bugs) - **Saved**: load from saved model on disk, compare against reference (catches stale saves) Keep component test scripts around -- you will need to re-run them during pipeline debugging with different inputs or config values. Template -- one self-contained script per component, reference and diffusers side-by-side: ```python @torch.inference_mode() def test_my_component(mode="fresh", model_path=None): # 1. Deterministic input gen = torch.Generator().manual_seed(42) x = torch.randn(1, 3, 64, 64, generator=gen, dtype=torch.float32) # 2. Reference: load from checkpoint, run, free ref_model = ReferenceModel.from_config(config) ref_model.load_state_dict(load_weights("prefix"), strict=True) ref_model = ref_model.float().eval() ref_out = ref_model(x).clone() del ref_model # 3. Diffusers: fresh (convert weights) or saved (from_pretrained) if mode == "fresh": diff_model = convert_my_component(load_weights("prefix")) else: diff_model = DiffusersModel.from_pretrained(model_path, torch_dtype=torch.float32) diff_model = diff_model.float().eval() diff_out = diff_model(x) del diff_model # 4. Compare in same script -- no saving to disk max_diff = (ref_out - diff_out).abs().max().item() assert max_diff < 1e-3, f"FAIL: max_diff={max_diff:.2e}" ``` Key points: (a) both reference and diffusers component in one script -- never split into separate scripts that save/load intermediates, (b) deterministic input via seeded generator, (c) load one model at a time to fit in CPU RAM, (d) `.clone()` the reference output before deleting the model. **E2E visual (GPU/bfloat16) -- once the pipeline is assembled.** Both pipelines generate independently with identical seeds/params. Save outputs and compare visually. If outputs look identical, you're done -- no need for deeper testing. **Pipeline stage tests -- only if E2E fails and you need to isolate the bug.** If the user already suspects where divergence is, start there. Otherwise, work through stages in order. First, **match noise generation**: the way initial noise/latents are constructed (seed handling, generator, randn call order) often differs between the two scripts. If the noise doesn't match, nothing downstream will match. Check how noise is initialized in the diffusers script — if it doesn't match the reference, temporarily change it to match. Note what you changed so it can be reverted after parity is confirmed. For small models, run on CPU/float32 for strict comparison. For large models (e.g. 22B params), CPU/float32 is impractical -- use GPU/bfloat16 with `enable_model_cpu_offload()` and relax tolerances (max_diff < 1e-1 for bfloat16 is typical for passing tests; cosine similarity > 0.9999 is a good secondary check). Test encode and decode stages first -- they're simpler and bugs there are easier to fix. Only debug the denoising loop if encode and decode both pass. The challenge: pipelines are monolithic `__call__` methods -- you can't just call "the encode part". See [checkpoint-mechanism.md](checkpoint-mechanism.md) for the checkpoint class that lets you stop, save, or inject tensors at named locations inside the pipeline. **Stage test order — encode, decode, then denoise:** - **`encode`** (test first): Stop both pipelines at `"preloop"`. Compare **every single variable** that will be consumed by the denoising loop -- not just latents and sigmas, but also prompt embeddings, attention masks, positional coordinates, connector outputs, and any conditioning inputs. - **`decode`** (test second, before denoise): Run the reference pipeline fully -- checkpoint the post-loop latents AND let it finish to get the decoded output. Then feed those same post-loop latents through the diffusers pipeline's decode path. Compare both numerically AND visually. - **`denoise`** (test last): Run both pipelines with realistic `num_steps` (e.g. 30) so the scheduler computes correct sigmas/timesteps, but stop after 2 loop iterations using `after_step_1`. Don't set `num_steps=2` -- that produces unrealistic sigma schedules. ```python # Encode stage -- stop before the loop, compare ALL inputs: ref_ckpts = {"preloop": Checkpoint(save=True, stop=True)} run_reference_pipeline(ref_ckpts) ref_data = ref_ckpts["preloop"].data diff_ckpts = {"preloop": Checkpoint(save=True, stop=True)} run_diffusers_pipeline(diff_ckpts) diff_data = diff_ckpts["preloop"].data # Compare EVERY variable consumed by the denoise loop: compare_tensors("latents", ref_data["latents"], diff_data["latents"]) compare_tensors("sigmas", ref_data["sigmas"], diff_data["sigmas"]) compare_tensors("prompt_embeds", ref_data["prompt_embeds"], diff_data["prompt_embeds"]) # ... every single tensor the transformer forward() will receive ``` **E2E-injected visual test**: Once you've identified a suspected root cause using stage tests, confirm it with an e2e-injected run -- inject the known-good tensor from reference and generate a full video. If the output looks identical to reference, you've confirmed the root cause. ## Debugging technique: Injection for root-cause isolation When stage tests show divergence, **inject a known-good tensor from one pipeline into the other** to test whether the remaining code is correct. The principle: if you suspect input X is the root cause of divergence in stage S: 1. Run the reference pipeline and capture X 2. Run the diffusers pipeline but **replace** its X with the reference's X (via checkpoint load) 3. Compare outputs of stage S If outputs now match: X was the root cause. If they still diverge: the bug is in the stage logic itself, not in X. | What you're testing | What you inject | Where you inject | |---|---|---| | Is the decode stage correct? | Post-loop latents from reference | Before decode | | Is the denoise loop correct? | Pre-loop latents from reference | Before the loop | | Is step N correct? | Post-step-(N-1) latents from reference | Before step N | **Per-step accumulation tracing**: When injection confirms the loop is correct but you want to understand *how* a small initial difference compounds, capture `after_step_{i}` for every step and plot the max_diff curve. A healthy curve stays bounded; an exponential blowup in later steps points to an amplification mechanism (see Pitfall #13 in [pitfalls.md](pitfalls.md)). ## Debugging technique: Visual comparison via frame extraction For video pipelines, numerical metrics alone can be misleading. Extract and view individual frames: ```python import numpy as np from PIL import Image def extract_frames(video_np, frame_indices): """video_np: (frames, H, W, 3) float array in [0, 1]""" for idx in frame_indices: frame = (video_np[idx] * 255).clip(0, 255).astype(np.uint8) img = Image.fromarray(frame) img.save(f"frame_{idx}.png") # Compare specific frames from both pipelines extract_frames(ref_video, [0, 60, 120]) extract_frames(diff_video, [0, 60, 120]) ``` ## Testing rules 1. **Never use reference code in the diffusers test path.** Each side must use only its own code. 2. **Never monkey-patch model internals in tests.** Do not replace `model.forward` or patch internal methods. 3. **Debugging instrumentation must be non-destructive.** Checkpoint captures for debugging are fine, but must not alter control flow or outputs. 4. **Prefer CPU/float32 for numerical comparison when practical.** Float32 avoids bfloat16 precision noise that obscures real bugs. But for large models (22B+), GPU/bfloat16 with `enable_model_cpu_offload()` is necessary -- use relaxed tolerances and cosine similarity as a secondary metric. 5. **Test both fresh conversion AND saved model.** Fresh catches conversion logic bugs; saved catches stale/corrupted weights from previous runs. 6. **Diff configs before debugging.** Before investigating any divergence, dump and compare all config values. A 30-second config diff prevents hours of debugging based on wrong assumptions. 7. **Never modify cached/downloaded model configs directly.** Don't edit files in `~/.cache/huggingface/`. Instead, save to a local directory or open a PR on the upstream repo. 8. **Compare ALL loop inputs in the encode test.** The preloop checkpoint must capture every single tensor the transformer forward() will receive. ## Comparison utilities ```python def compare_tensors(name: str, a: torch.Tensor, b: torch.Tensor, tol: float = 1e-3) -> bool: if a.shape != b.shape: print(f" FAIL {name}: shape mismatch {a.shape} vs {b.shape}") return False diff = (a.float() - b.float()).abs() max_diff = diff.max().item() mean_diff = diff.mean().item() cos = torch.nn.functional.cosine_similarity( a.float().flatten().unsqueeze(0), b.float().flatten().unsqueeze(0) ).item() passed = max_diff < tol print(f" {'PASS' if passed else 'FAIL'} {name}: max={max_diff:.2e}, mean={mean_diff:.2e}, cos={cos:.5f}") return passed ``` Cosine similarity is especially useful for GPU/bfloat16 tests where max_diff can be noisy -- `cos > 0.9999` is a strong signal even when max_diff exceeds tolerance. ## Gotchas See [pitfalls.md](pitfalls.md) for the full list of gotchas to watch for during parity testing. ================================================ FILE: .ai/skills/parity-testing/checkpoint-mechanism.md ================================================ # Checkpoint Mechanism for Stage Testing ## Overview Pipelines are monolithic `__call__` methods -- you can't just call "the encode part". The checkpoint mechanism lets you stop, save, or inject tensors at named locations inside the pipeline. ## The Checkpoint class Add a `_checkpoints` argument to both the diffusers pipeline and the reference implementation. ```python @dataclass class Checkpoint: save: bool = False # capture variables into ckpt.data stop: bool = False # halt pipeline after this point load: bool = False # inject ckpt.data into local variables data: dict = field(default_factory=dict) ``` ## Pipeline instrumentation The pipeline accepts an optional `dict[str, Checkpoint]`. Place checkpoint calls at boundaries between pipeline stages -- after each encoder, before the denoising loop (capture all loop inputs), after each loop iteration, after the loop (capture final latents before decode). ```python def __call__(self, prompt, ..., _checkpoints=None): # --- text encoding --- prompt_embeds = self.text_encoder(prompt) _maybe_checkpoint(_checkpoints, "text_encoding", { "prompt_embeds": prompt_embeds, }) # --- prepare latents, sigmas, positions --- latents = self.prepare_latents(...) sigmas = self.scheduler.sigmas # ... _maybe_checkpoint(_checkpoints, "preloop", { "latents": latents, "sigmas": sigmas, "prompt_embeds": prompt_embeds, "prompt_attention_mask": prompt_attention_mask, "video_coords": video_coords, # capture EVERYTHING the loop needs -- every tensor the transformer # forward() receives. Missing even one variable here means you can't # tell if it's the source of divergence during denoise debugging. }) # --- denoising loop --- for i, t in enumerate(timesteps): noise_pred = self.transformer(latents, t, prompt_embeds, ...) latents = self.scheduler.step(noise_pred, t, latents)[0] _maybe_checkpoint(_checkpoints, f"after_step_{i}", { "latents": latents, }) _maybe_checkpoint(_checkpoints, "post_loop", { "latents": latents, }) # --- decode --- video = self.vae.decode(latents) return video ``` ## The helper function Each `_maybe_checkpoint` call does three things based on the Checkpoint's flags: `save` captures the local variables into `ckpt.data`, `load` injects pre-populated `ckpt.data` back into local variables, `stop` halts execution (raises an exception caught at the top level). ```python def _maybe_checkpoint(checkpoints, name, data): if not checkpoints: return ckpt = checkpoints.get(name) if ckpt is None: return if ckpt.save: ckpt.data.update(data) if ckpt.stop: raise PipelineStop # caught at __call__ level, returns None ``` ## Injection support Add `load` support at each checkpoint where you might want to inject: ```python _maybe_checkpoint(_checkpoints, "preloop", {"latents": latents, ...}) # Load support: replace local variables with injected data if _checkpoints: ckpt = _checkpoints.get("preloop") if ckpt is not None and ckpt.load: latents = ckpt.data["latents"].to(device=device, dtype=latents.dtype) ``` ## Key insight The checkpoint dict is passed into the pipeline and mutated in-place. After the pipeline returns (or stops early), you read back `ckpt.data` to get the captured tensors. Both pipelines save under their own key names, so the test maps between them (e.g. reference `"video_state.latent"` -> diffusers `"latents"`). ## Memory management for large models For large models, free the source pipeline's GPU memory before loading the target pipeline. Clone injected tensors to CPU, delete everything else, then run the target with `enable_model_cpu_offload()`. ================================================ FILE: .ai/skills/parity-testing/pitfalls.md ================================================ # Complete Pitfalls Reference ## 1. Global CPU RNG `MultivariateNormal.sample()` uses the global CPU RNG, not `torch.Generator`. Must call `torch.manual_seed(seed)` before each pipeline run. A `generator=` kwarg won't help. ## 2. Timestep dtype Many transformers expect `int64` timesteps. `get_timestep_embedding` casts to float, so `745.3` and `745` produce different embeddings. Match the reference's casting. ## 3. Guidance parameter mapping Parameter names may differ: reference `zero_steps=1` (meaning `i <= 1`, 2 steps) vs target `zero_init_steps=2` (meaning `step < 2`, same thing). Check exact semantics. ## 4. `patch_size` in noise generation If noise generation depends on `patch_size` (e.g. `sample_block_noise`), it must be passed through. Missing it changes noise spatial structure. ## 5. Variable shadowing in nested loops Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses `latents` and inner loop also assigns to `latents`, scoping must match the reference. ## 6. Float precision differences -- don't dismiss them Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) *look* harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept. ## 7. Scheduler state reset between stages Some schedulers accumulate state (e.g. `model_outputs` in UniPC) that must be cleared between stages. ## 8. Component access Standard: `self.transformer`. Modular: `components.transformer`. Missing this causes AttributeError. ## 9. Guider state across stages In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages. ## 10. Model storage location NEVER store converted models in `/tmp/` -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. `models/ltx23-diffusers/`). ## 11. Noise dtype mismatch (causes washed-out output) Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing: ```python noise = torch.randn(..., dtype=torch.float32, generator=gen) noise = noise.to(dtype=model_dtype) # bfloat16 -- values get quantized ``` Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output. **Fix**: Match the reference -- generate noise in the model's working dtype: ```python latent_dtype = self.transformer.dtype # e.g. bfloat16 latents = self.prepare_latents(..., dtype=latent_dtype, ...) ``` **Detection**: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error. ## 12. RoPE position dtype RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully. ## 13. 1/sigma error amplification in Euler denoising In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose. ## 14. Config value assumptions -- always diff, never assume When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction. **The pattern that goes wrong:** 1. You see `param_x` has default `1` in the code 2. The reference code also uses `param_x` with a default of `1` 3. You assume both sides use `1` and apply a "fix" based on that 4. But the actual checkpoint config has `param_x: 1000`, and so does the published diffusers config 5. Your "fix" now *creates* divergence instead of fixing it **Prevention -- config diff first:** ```python # Reference: read from checkpoint metadata (no model loading needed) from safetensors import safe_open import json ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"]) # Diffusers: read from model config from diffusers import MyModel diff_model = MyModel.from_pretrained(model_path, subfolder="transformer") diff_config = dict(diff_model.config) # Compare all values for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))): ref_val = ref_config.get("transformer", {}).get(key, "MISSING") diff_val = diff_config.get(key, "MISSING") if ref_val != diff_val: print(f" DIFF {key}: ref={ref_val}, diff={diff_val}") ``` Run this **before** writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately. **When debugging divergence -- trace values, don't reason about them:** If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts. ## 15. Decoder config mismatch (causes pixelated artifacts) The upstream model config may have wrong values for decoder-specific parameters (e.g. `upsample_residual`, `upsample_type`). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts. **Detection**: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts. **Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7). ## 16. Incomplete injection tests -- inject ALL variables or the test is invalid When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject **every** divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause. **Prevention**: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines. ## 17. bf16 connector/encoder divergence -- don't chase it When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on. ## 18. Stale test fixtures When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution. ================================================ FILE: .github/ISSUE_TEMPLATE/bug-report.yml ================================================ name: "\U0001F41B Bug Report" description: Report a bug on Diffusers labels: [ "bug" ] body: - type: markdown attributes: value: | Thanks a lot for taking the time to file this issue 🤗. Issues do not only help to improve the library, but also publicly document common problems, questions, workflows for the whole community! Thus, issues are of the same importance as pull requests when contributing to this library ❤️. In order to make your issue as **useful for the community as possible**, let's try to stick to some simple guidelines: - 1. Please try to be as precise and concise as possible. *Give your issue a fitting title. Assume that someone which very limited knowledge of Diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...* - 2. If your issue is about something not working, **always** provide a reproducible code snippet. The reader should be able to reproduce your issue by **only copy-pasting your code snippet into a Python shell**. *The community cannot solve your issue if it cannot reproduce it. If your bug is related to training, add your training script and make everything needed to train public. Otherwise, just add a simple Python code snippet.* - 3. Add the **minimum** amount of code / context that is needed to understand, reproduce your issue. *Make the life of maintainers easy. `diffusers` is getting many issues every day. Make sure your issue is about one bug and one bug only. Make sure you add only the context, code needed to understand your issues - nothing more. Generally, every issue is a way of documenting this library, try to make it a good documentation entry.* - 4. For issues related to community pipelines (i.e., the pipelines located in the `examples/community` folder), please tag the author of the pipeline in your issue thread as those pipelines are not maintained. - type: markdown attributes: value: | For more in-detail information on how to write good issues you can have a look [here](https://huggingface.co/course/chapter8/5?fw=pt). - type: textarea id: bug-description attributes: label: Describe the bug description: A clear and concise description of what the bug is. If you intend to submit a pull request for this issue, tell us in the description. Thanks! placeholder: Bug description validations: required: true - type: textarea id: reproduction attributes: label: Reproduction description: Please provide a minimal reproducible code which we can copy/paste and reproduce the issue. placeholder: Reproduction validations: required: true - type: textarea id: logs attributes: label: Logs description: "Please include the Python logs if you can." render: shell - type: textarea id: system-info attributes: label: System Info description: Please share your system info with us. You can run the command `diffusers-cli env` and copy-paste its output below. placeholder: Diffusers version, platform, Python version, ... validations: required: true - type: textarea id: who-can-help attributes: label: Who can help? description: | Your issue will be replied to more quickly if you can figure out the right person to tag with @. If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and a core maintainer will ping the right person. Please tag a maximum of 2 people. Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): @sayakpaul @DN6 Questions on pipelines: - Stable Diffusion @yiyixuxu @asomoza - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 - Stable Diffusion 3: @yiyixuxu @sayakpaul @DN6 @asomoza - Kandinsky @yiyixuxu - ControlNet @sayakpaul @yiyixuxu @DN6 - T2I Adapter @sayakpaul @yiyixuxu @DN6 - IF @DN6 - Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w - Wuerstchen @DN6 - Other: @yiyixuxu @DN6 - Improving generation quality: @asomoza Questions on models: - UNet @DN6 @yiyixuxu @sayakpaul - VAE @sayakpaul @DN6 @yiyixuxu - Transformers/Attention @DN6 @yiyixuxu @sayakpaul Questions on single file checkpoints: @DN6 Questions on Schedulers: @yiyixuxu Questions on LoRA: @sayakpaul Questions on Textual Inversion: @sayakpaul Questions on Training: - DreamBooth @sayakpaul - Text-to-Image Fine-tuning @sayakpaul - Textual Inversion @sayakpaul - ControlNet @sayakpaul Questions on Tests: @DN6 @sayakpaul @yiyixuxu Questions on Documentation: @stevhliu Questions on JAX- and MPS-related things: @pcuenca Questions on audio pipelines: @sanchit-gandhi placeholder: "@Username ..." ================================================ FILE: .github/ISSUE_TEMPLATE/config.yml ================================================ contact_links: - name: Questions / Discussions url: https://github.com/huggingface/diffusers/discussions about: General usage questions and community discussions ================================================ FILE: .github/ISSUE_TEMPLATE/feature_request.md ================================================ --- name: "\U0001F680 Feature Request" about: Suggest an idea for this project title: '' labels: '' assignees: '' --- **Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]. **Describe the solution you'd like.** A clear and concise description of what you want to happen. **Describe alternatives you've considered.** A clear and concise description of any alternative solutions or features you've considered. **Additional context.** Add any other context or screenshots about the feature request here. ================================================ FILE: .github/ISSUE_TEMPLATE/feedback.md ================================================ --- name: "💬 Feedback about API Design" about: Give feedback about the current API design title: '' labels: '' assignees: '' --- **What API design would you like to have changed or added to the library? Why?** **What use case would this enable or better enable? Can you give us a code example?** ================================================ FILE: .github/ISSUE_TEMPLATE/new-model-addition.yml ================================================ name: "\U0001F31F New Model/Pipeline/Scheduler Addition" description: Submit a proposal/request to implement a new diffusion model/pipeline/scheduler labels: [ "New model/pipeline/scheduler" ] body: - type: textarea id: description-request validations: required: true attributes: label: Model/Pipeline/Scheduler description description: | Put any and all important information relative to the model/pipeline/scheduler - type: checkboxes id: information-tasks attributes: label: Open source status description: | Please note that if the model implementation isn't available or if the weights aren't open-source, we are less likely to implement it in `diffusers`. options: - label: "The model implementation is available." - label: "The model weights are available (Only relevant if addition is not a scheduler)." - type: textarea id: additional-info attributes: label: Provide useful links for the implementation description: | Please provide information regarding the implementation, the weights, and the authors. Please mention the authors by @gh-username if you're aware of their usernames. ================================================ FILE: .github/ISSUE_TEMPLATE/remote-vae-pilot-feedback.yml ================================================ name: "\U0001F31F Remote VAE" description: Feedback for remote VAE pilot labels: [ "Remote VAE" ] body: - type: textarea id: positive validations: required: true attributes: label: Did you like the remote VAE solution? description: | If you liked it, we would appreciate it if you could elaborate what you liked. - type: textarea id: feedback validations: required: true attributes: label: What can be improved about the current solution? description: | Let us know the things you would like to see improved. Note that we will work optimizing the solution once the pilot is over and we have usage. - type: textarea id: others validations: required: true attributes: label: What other VAEs you would like to see if the pilot goes well? description: | Provide a list of the VAEs you would like to see in the future if the pilot goes well. - type: textarea id: additional-info attributes: label: Notify the members of the team description: | Tag the following folks when submitting this feedback: @hlky @sayakpaul ================================================ FILE: .github/ISSUE_TEMPLATE/translate.md ================================================ --- name: 🌐 Translating a New Language? about: Start a new translation effort in your language title: '[] Translating docs to ' labels: WIP assignees: '' --- Hi! Let's bring the documentation to all the -speaking community 🌐. Who would want to translate? Please follow the 🤗 [TRANSLATING guide](https://github.com/huggingface/diffusers/blob/main/docs/TRANSLATING.md). Here is a list of the files ready for translation. Let us know in this issue if you'd like to translate any, and we'll add your name to the list. Some notes: * Please translate using an informal tone (imagine you are talking with a friend about Diffusers 🤗). * Please translate in a gender-neutral way. * Add your translations to the folder called `` inside the [source folder](https://github.com/huggingface/diffusers/tree/main/docs/source). * Register your translation in `/_toctree.yml`; please follow the order of the [English version](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml). * Once you're finished, open a pull request and tag this issue by including #issue-number in the description, where issue-number is the number of this issue. Please ping @stevhliu for review. * 🙋 If you'd like others to help you with the translation, you can also post in the 🤗 [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63). Thank you so much for your help! 🤗 ================================================ FILE: .github/PULL_REQUEST_TEMPLATE.md ================================================ # What does this PR do? Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)? - [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)? - [ ] Was this discussed/approved via a GitHub issue or the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/diffusers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. ================================================ FILE: .github/actions/setup-miniconda/action.yml ================================================ name: Set up conda environment for testing description: Sets up miniconda in your ${RUNNER_TEMP} environment and gives you the ${CONDA_RUN} environment variable so you don't have to worry about polluting non-empeheral runners anymore inputs: python-version: description: If set to any value, don't use sudo to clean the workspace required: false type: string default: "3.9" miniconda-version: description: Miniconda version to install required: false type: string default: "4.12.0" environment-file: description: Environment file to install dependencies from required: false type: string default: "" runs: using: composite steps: # Use the same trick from https://github.com/marketplace/actions/setup-miniconda # to refresh the cache daily. This is kind of optional though - name: Get date id: get-date shell: bash run: echo "today=$(/bin/date -u '+%Y%m%d')d" >> $GITHUB_OUTPUT - name: Setup miniconda cache id: miniconda-cache uses: actions/cache@v2 with: path: ${{ runner.temp }}/miniconda key: miniconda-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }} - name: Install miniconda (${{ inputs.miniconda-version }}) if: steps.miniconda-cache.outputs.cache-hit != 'true' env: MINICONDA_VERSION: ${{ inputs.miniconda-version }} shell: bash -l {0} run: | MINICONDA_INSTALL_PATH="${RUNNER_TEMP}/miniconda" mkdir -p "${MINICONDA_INSTALL_PATH}" case ${RUNNER_OS}-${RUNNER_ARCH} in Linux-X64) MINICONDA_ARCH="Linux-x86_64" ;; macOS-ARM64) MINICONDA_ARCH="MacOSX-arm64" ;; macOS-X64) MINICONDA_ARCH="MacOSX-x86_64" ;; *) echo "::error::Platform ${RUNNER_OS}-${RUNNER_ARCH} currently unsupported using this action" exit 1 ;; esac MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_${MINICONDA_VERSION}-${MINICONDA_ARCH}.sh" curl -fsSL "${MINICONDA_URL}" -o "${MINICONDA_INSTALL_PATH}/miniconda.sh" bash "${MINICONDA_INSTALL_PATH}/miniconda.sh" -b -u -p "${MINICONDA_INSTALL_PATH}" rm -rf "${MINICONDA_INSTALL_PATH}/miniconda.sh" - name: Update GitHub path to include miniconda install shell: bash run: | MINICONDA_INSTALL_PATH="${RUNNER_TEMP}/miniconda" echo "${MINICONDA_INSTALL_PATH}/bin" >> $GITHUB_PATH - name: Setup miniconda env cache (with env file) id: miniconda-env-cache-env-file if: ${{ runner.os }} == 'macOS' && ${{ inputs.environment-file }} != '' uses: actions/cache@v2 with: path: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} key: miniconda-env-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }}-${{ hashFiles(inputs.environment-file) }} - name: Setup miniconda env cache (without env file) id: miniconda-env-cache if: ${{ runner.os }} == 'macOS' && ${{ inputs.environment-file }} == '' uses: actions/cache@v2 with: path: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} key: miniconda-env-${{ runner.os }}-${{ runner.arch }}-${{ inputs.python-version }}-${{ steps.get-date.outputs.today }} - name: Setup conda environment with python (v${{ inputs.python-version }}) if: steps.miniconda-env-cache-env-file.outputs.cache-hit != 'true' && steps.miniconda-env-cache.outputs.cache-hit != 'true' shell: bash env: PYTHON_VERSION: ${{ inputs.python-version }} ENV_FILE: ${{ inputs.environment-file }} run: | CONDA_BASE_ENV="${RUNNER_TEMP}/conda-python-${PYTHON_VERSION}" ENV_FILE_FLAG="" if [[ -f "${ENV_FILE}" ]]; then ENV_FILE_FLAG="--file ${ENV_FILE}" elif [[ -n "${ENV_FILE}" ]]; then echo "::warning::Specified env file (${ENV_FILE}) not found, not going to include it" fi conda create \ --yes \ --prefix "${CONDA_BASE_ENV}" \ "python=${PYTHON_VERSION}" \ ${ENV_FILE_FLAG} \ cmake=3.22 \ conda-build=3.21 \ ninja=1.10 \ pkg-config=0.29 \ wheel=0.37 - name: Clone the base conda environment and update GitHub env shell: bash env: PYTHON_VERSION: ${{ inputs.python-version }} CONDA_BASE_ENV: ${{ runner.temp }}/conda-python-${{ inputs.python-version }} run: | CONDA_ENV="${RUNNER_TEMP}/conda_environment_${GITHUB_RUN_ID}" conda create \ --yes \ --prefix "${CONDA_ENV}" \ --clone "${CONDA_BASE_ENV}" # TODO: conda-build could not be cloned because it hardcodes the path, so it # could not be cached conda install --yes -p ${CONDA_ENV} conda-build=3.21 echo "CONDA_ENV=${CONDA_ENV}" >> "${GITHUB_ENV}" echo "CONDA_RUN=conda run -p ${CONDA_ENV} --no-capture-output" >> "${GITHUB_ENV}" echo "CONDA_BUILD=conda run -p ${CONDA_ENV} conda-build" >> "${GITHUB_ENV}" echo "CONDA_INSTALL=conda install -p ${CONDA_ENV}" >> "${GITHUB_ENV}" - name: Get disk space usage and throw an error for low disk space shell: bash run: | echo "Print the available disk space for manual inspection" df -h # Set the minimum requirement space to 4GB MINIMUM_AVAILABLE_SPACE_IN_GB=4 MINIMUM_AVAILABLE_SPACE_IN_KB=$(($MINIMUM_AVAILABLE_SPACE_IN_GB * 1024 * 1024)) # Use KB to avoid floating point warning like 3.1GB df -k | tr -s ' ' | cut -d' ' -f 4,9 | while read -r LINE; do AVAIL=$(echo $LINE | cut -f1 -d' ') MOUNT=$(echo $LINE | cut -f2 -d' ') if [ "$MOUNT" = "/" ]; then if [ "$AVAIL" -lt "$MINIMUM_AVAILABLE_SPACE_IN_KB" ]; then echo "There is only ${AVAIL}KB free space left in $MOUNT, which is less than the minimum requirement of ${MINIMUM_AVAILABLE_SPACE_IN_KB}KB. Please help create an issue to PyTorch Release Engineering via https://github.com/pytorch/test-infra/issues and provide the link to the workflow run." exit 1; else echo "There is ${AVAIL}KB free space left in $MOUNT, continue" fi fi done ================================================ FILE: .github/workflows/benchmark.yml ================================================ name: Benchmarking tests on: workflow_dispatch: schedule: - cron: "30 1 1,15 * *" # every 2 weeks on the 1st and the 15th of every month at 1:30 AM env: DIFFUSERS_IS_CI: yes HF_XET_HIGH_PERFORMANCE: 1 HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 BASE_PATH: benchmark_outputs jobs: torch_models_cuda_benchmark_tests: env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_BENCHMARK }} name: Torch Core Models CUDA Benchmarking Tests strategy: fail-fast: false max-parallel: 1 runs-on: group: aws-g6e-4xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | apt update apt install -y libpq-dev postgresql-client uv pip install -e ".[quality]" uv pip install -r benchmarks/requirements.txt - name: Environment run: | python utils/print_env.py - name: Diffusers Benchmarking env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | cd benchmarks && python run_all.py - name: Push results to the Hub env: HF_TOKEN: ${{ secrets.DIFFUSERS_BOT_TOKEN }} run: | cd benchmarks && python push_results.py mkdir $BASE_PATH && cp *.csv $BASE_PATH - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: benchmark_test_reports path: benchmarks/${{ env.BASE_PATH }} - name: Report success status if: ${{ success() }} run: | pip install requests && python utils/notify_benchmarking_status.py --status=success - name: Report failure status if: ${{ failure() }} run: | pip install requests && python utils/notify_benchmarking_status.py --status=failure ================================================ FILE: .github/workflows/build_docker_images.yml ================================================ name: Test, build, and push Docker images on: pull_request: # During PRs, we just check if the changes Dockerfiles can be successfully built branches: - main paths: - "docker/**" workflow_dispatch: schedule: - cron: "0 0 * * *" # every day at midnight concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true env: REGISTRY: diffusers CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }} jobs: test-build-docker-images: runs-on: group: aws-general-8-plus if: github.event_name == 'pull_request' steps: - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v6 - name: Find Changed Dockerfiles id: file_changes uses: jitterbit/get-changed-files@v1 with: format: "space-delimited" token: ${{ secrets.GITHUB_TOKEN }} - name: Build Changed Docker Images env: CHANGED_FILES: ${{ steps.file_changes.outputs.all }} run: | echo "$CHANGED_FILES" ALLOWED_IMAGES=( diffusers-pytorch-cpu diffusers-pytorch-cuda diffusers-pytorch-xformers-cuda diffusers-pytorch-minimum-cuda diffusers-doc-builder ) declare -A IMAGES_TO_BUILD=() for FILE in $CHANGED_FILES; do # skip anything that isn't still on disk if [[ ! -e "$FILE" ]]; then echo "Skipping removed file $FILE" continue fi for IMAGE in "${ALLOWED_IMAGES[@]}"; do if [[ "$FILE" == docker/${IMAGE}/* ]]; then IMAGES_TO_BUILD["$IMAGE"]=1 fi done done if [[ ${#IMAGES_TO_BUILD[@]} -eq 0 ]]; then echo "No relevant Docker changes detected." exit 0 fi for IMAGE in "${!IMAGES_TO_BUILD[@]}"; do DOCKER_PATH="docker/${IMAGE}" echo "Building Docker image for $IMAGE" docker build -t "$IMAGE" "$DOCKER_PATH" done if: steps.file_changes.outputs.all != '' build-and-push-docker-images: runs-on: group: aws-general-8-plus if: github.event_name != 'pull_request' permissions: contents: read packages: write strategy: fail-fast: false matrix: image-name: - diffusers-pytorch-cpu - diffusers-pytorch-cuda - diffusers-pytorch-xformers-cuda - diffusers-pytorch-minimum-cuda - diffusers-doc-builder steps: - name: Checkout repository uses: actions/checkout@v6 - name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Login to Docker Hub uses: docker/login-action@v3 with: username: ${{ env.REGISTRY }} password: ${{ secrets.DOCKERHUB_TOKEN }} - name: Build and push uses: docker/build-push-action@v6 with: no-cache: true context: ./docker/${{ matrix.image-name }} push: true tags: ${{ env.REGISTRY }}/${{ matrix.image-name }}:latest - name: Post to a Slack channel id: slack uses: huggingface/hf-workflows/.github/actions/post-slack@main with: # Slack channel id, channel name, or user id to post message. # See also: https://api.slack.com/methods/chat.postMessage#channels slack_channel: ${{ env.CI_SLACK_CHANNEL }} title: "🤗 Results of the ${{ matrix.image-name }} Docker Image build" status: ${{ job.status }} slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} ================================================ FILE: .github/workflows/build_documentation.yml ================================================ name: Build documentation on: push: branches: - main - doc-builder* - v*-release - v*-patch paths: - "src/diffusers/**.py" - "examples/**" - "docs/**" jobs: build: uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main with: commit_sha: ${{ github.sha }} install_libgl1: true package: diffusers notebook_folder: diffusers_doc languages: en ko zh ja pt custom_container: diffusers/diffusers-doc-builder secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} ================================================ FILE: .github/workflows/build_pr_documentation.yml ================================================ name: Build PR Documentation on: pull_request: paths: - "src/diffusers/**.py" - "examples/**" - "docs/**" concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true jobs: check-links: runs-on: ubuntu-latest steps: - name: Checkout repository uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: '3.10' - name: Install uv run: | curl -LsSf https://astral.sh/uv/install.sh | sh echo "$HOME/.cargo/bin" >> $GITHUB_PATH - name: Install doc-builder run: | uv pip install --system git+https://github.com/huggingface/doc-builder.git@main - name: Check documentation links run: | uv run doc-builder check-links docs/source/en build: needs: check-links uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main with: commit_sha: ${{ github.event.pull_request.head.sha }} pr_number: ${{ github.event.number }} install_libgl1: true package: diffusers languages: en ko zh ja pt custom_container: diffusers/diffusers-doc-builder ================================================ FILE: .github/workflows/codeql.yml ================================================ --- name: CodeQL Security Analysis For Github Actions on: push: branches: ["main"] workflow_dispatch: # pull_request: jobs: codeql: name: CodeQL Analysis uses: huggingface/security-workflows/.github/workflows/codeql-reusable.yml@v1 permissions: security-events: write packages: read actions: read contents: read with: languages: '["actions","python"]' queries: 'security-extended,security-and-quality' runner: 'ubuntu-latest' #optional if need custom runner ================================================ FILE: .github/workflows/mirror_community_pipeline.yml ================================================ name: Mirror Community Pipeline on: # Push changes on the main branch push: branches: - main paths: - 'examples/community/**.py' # And on tag creation (e.g. `v0.28.1`) tags: - '*' # Manual trigger with ref input workflow_dispatch: inputs: ref: description: "Either 'main' or a tag ref" required: true default: 'main' jobs: mirror_community_pipeline: env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_COMMUNITY_MIRROR }} runs-on: ubuntu-22.04 steps: # Checkout to correct ref # If workflow dispatch # If ref is 'main', set: # CHECKOUT_REF=refs/heads/main # PATH_IN_REPO=main # Else it must be a tag. Set: # CHECKOUT_REF=refs/tags/{tag} # PATH_IN_REPO={tag} # If not workflow dispatch # If ref is 'refs/heads/main' => set 'main' # Else it must be a tag => set {tag} - name: Set checkout_ref and path_in_repo env: EVENT_NAME: ${{ github.event_name }} EVENT_INPUT_REF: ${{ github.event.inputs.ref }} GITHUB_REF: ${{ github.ref }} run: | if [ "$EVENT_NAME" == "workflow_dispatch" ]; then if [ -z "$EVENT_INPUT_REF" ]; then echo "Error: Missing ref input" exit 1 elif [ "$EVENT_INPUT_REF" == "main" ]; then echo "CHECKOUT_REF=refs/heads/main" >> $GITHUB_ENV echo "PATH_IN_REPO=main" >> $GITHUB_ENV else echo "CHECKOUT_REF=refs/tags/$EVENT_INPUT_REF" >> $GITHUB_ENV echo "PATH_IN_REPO=$EVENT_INPUT_REF" >> $GITHUB_ENV fi elif [ "$GITHUB_REF" == "refs/heads/main" ]; then echo "CHECKOUT_REF=$GITHUB_REF" >> $GITHUB_ENV echo "PATH_IN_REPO=main" >> $GITHUB_ENV else # e.g. refs/tags/v0.28.1 -> v0.28.1 echo "CHECKOUT_REF=$GITHUB_REF" >> $GITHUB_ENV echo "PATH_IN_REPO=$(echo $GITHUB_REF | sed 's/^refs\/tags\///')" >> $GITHUB_ENV fi - name: Print env vars run: | echo "CHECKOUT_REF: ${{ env.CHECKOUT_REF }}" echo "PATH_IN_REPO: ${{ env.PATH_IN_REPO }}" - uses: actions/checkout@v6 with: ref: ${{ env.CHECKOUT_REF }} # Setup + install dependencies - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install --upgrade huggingface_hub # Check secret is set - name: whoami run: hf auth whoami env: HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} # Push to HF! (under subfolder based on checkout ref) # https://huggingface.co/datasets/diffusers/community-pipelines-mirror - name: Mirror community pipeline to HF run: hf upload diffusers/community-pipelines-mirror ./examples/community ${PATH_IN_REPO} --repo-type dataset env: PATH_IN_REPO: ${{ env.PATH_IN_REPO }} HF_TOKEN: ${{ secrets.HF_TOKEN_MIRROR_COMMUNITY_PIPELINES }} - name: Report success status if: ${{ success() }} run: | pip install requests && python utils/notify_community_pipelines_mirror.py --status=success - name: Report failure status if: ${{ failure() }} run: | pip install requests && python utils/notify_community_pipelines_mirror.py --status=failure ================================================ FILE: .github/workflows/nightly_tests.yml ================================================ name: Nightly and release tests on main/release branch on: workflow_dispatch: schedule: - cron: "0 0 * * *" # every day at midnight env: DIFFUSERS_IS_CI: yes HF_XET_HIGH_PERFORMANCE: 1 OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 PYTEST_TIMEOUT: 600 RUN_SLOW: yes RUN_NIGHTLY: yes PIPELINE_USAGE_CUTOFF: 0 SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} CONSOLIDATED_REPORT_PATH: consolidated_test_report.md jobs: setup_torch_cuda_pipeline_matrix: name: Setup Torch Pipelines CUDA Slow Tests Matrix runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | pip install -e .[test] pip install huggingface_hub - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix run: | matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) echo $matrix echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - name: Pipeline Tests Artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: test-pipelines.json path: reports run_nightly_tests_for_torch_pipelines: name: Nightly Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 uv pip install pytest-reportlog - name: Environment run: | python utils/print_env.py - name: Pipeline CUDA Test env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ --report-log=tests_pipeline_${{ matrix.module }}_cuda.log \ tests/pipelines/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pipeline_${{ matrix.module }}_test_reports path: reports run_nightly_tests_for_other_torch_modules: name: Nightly Torch CUDA Tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: fail-fast: false max-parallel: 2 matrix: module: [models, schedulers, lora, others, single_file, examples] steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 uv pip install pytest-reportlog - name: Environment run: python utils/print_env.py - name: Run nightly PyTorch CUDA tests for non-pipeline modules if: ${{ matrix.module != 'examples'}} env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_${{ matrix.module }}_cuda \ --report-log=tests_torch_${{ matrix.module }}_cuda.log \ tests/${{ matrix.module }} - name: Run nightly example tests with Torch if: ${{ matrix.module == 'examples' }} env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ --make-reports=examples_torch_cuda \ --report-log=examples_torch_cuda.log \ examples/ - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_${{ matrix.module }}_cuda_test_reports path: reports run_torch_compile_tests: name: PyTorch Compile CUDA tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 - name: Environment run: | python utils/print_env.py - name: Run torch compile tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} RUN_COMPILE: yes run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "compile" --make-reports=tests_torch_compile_cuda tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_compile_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_compile_test_reports path: reports run_big_gpu_torch_tests: name: Torch tests on big GPU strategy: fail-fast: false max-parallel: 2 runs-on: group: aws-g6e-xlarge-plus container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 uv pip install pytest-reportlog - name: Environment run: | python utils/print_env.py - name: Selected Torch CUDA Test on big GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 BIG_GPU_MEMORY: 40 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -m "big_accelerator" \ --make-reports=tests_big_gpu_torch_cuda \ --report-log=tests_big_gpu_torch_cuda.log \ tests/ - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_big_gpu_torch_cuda_stats.txt cat reports/tests_big_gpu_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_big_gpu_test_reports path: reports torch_minimum_version_cuda_tests: name: Torch Minimum Version CUDA Tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-minimum-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 - name: Environment run: | python utils/print_env.py - name: Run PyTorch CUDA tests env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_minimum_version_cuda \ tests/models/test_modeling_common.py \ tests/pipelines/test_pipelines_common.py \ tests/pipelines/test_pipeline_utils.py \ tests/pipelines/test_pipelines.py \ tests/pipelines/test_pipelines_auto.py \ tests/schedulers/test_schedulers.py \ tests/others - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_minimum_version_cuda_stats.txt cat reports/tests_torch_minimum_version_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_minimum_version_cuda_test_reports path: reports run_nightly_quantization_tests: name: Torch quantization nightly tests strategy: fail-fast: false max-parallel: 2 matrix: config: - backend: "bitsandbytes" test_location: "bnb" additional_deps: ["peft"] - backend: "gguf" test_location: "gguf" additional_deps: ["peft", "kernels"] - backend: "torchao" test_location: "torchao" additional_deps: [] - backend: "optimum_quanto" test_location: "quanto" additional_deps: [] - backend: "nvidia_modelopt" test_location: "modelopt" additional_deps: [] runs-on: group: aws-g6e-xlarge-plus container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "20gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install -U ${{ matrix.config.backend }} if [ "${{ join(matrix.config.additional_deps, ' ') }}" != "" ]; then uv pip install ${{ join(matrix.config.additional_deps, ' ') }} fi uv pip install pytest-reportlog #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 - name: Environment run: | python utils/print_env.py - name: ${{ matrix.config.backend }} quantization tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 BIG_GPU_MEMORY: 40 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_${{ matrix.config.backend }}_torch_cuda \ --report-log=tests_${{ matrix.config.backend }}_torch_cuda.log \ tests/quantization/${{ matrix.config.test_location }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_${{ matrix.config.backend }}_torch_cuda_stats.txt cat reports/tests_${{ matrix.config.backend }}_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_${{ matrix.config.backend }}_reports path: reports run_nightly_pipeline_level_quantization_tests: name: Torch quantization nightly tests strategy: fail-fast: false max-parallel: 2 runs-on: group: aws-g6e-xlarge-plus container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "20gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install -U bitsandbytes optimum_quanto #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 uv pip install pytest-reportlog - name: Environment run: | python utils/print_env.py - name: Pipeline-level quantization tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 BIG_GPU_MEMORY: 40 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_pipeline_level_quant_torch_cuda \ --report-log=tests_pipeline_level_quant_torch_cuda.log \ tests/quantization/test_pipeline_level_quantization.py - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_level_quant_torch_cuda_stats.txt cat reports/tests_pipeline_level_quant_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_pipeline_level_quant_reports path: reports generate_consolidated_report: name: Generate Consolidated Test Report needs: [ run_nightly_tests_for_torch_pipelines, run_nightly_tests_for_other_torch_modules, run_torch_compile_tests, run_big_gpu_torch_tests, run_nightly_quantization_tests, run_nightly_pipeline_level_quantization_tests, # run_nightly_onnx_tests, torch_minimum_version_cuda_tests, # run_flax_tpu_tests ] if: always() runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Create reports directory run: mkdir -p combined_reports - name: Download all test reports uses: actions/download-artifact@v7 with: path: artifacts - name: Prepare reports run: | # Move all report files to a single directory for processing find artifacts -name "*.txt" -exec cp {} combined_reports/ \; - name: Install dependencies run: | pip install -e .[test] pip install slack_sdk tabulate - name: Generate consolidated report run: | python utils/consolidated_test_report.py \ --reports_dir combined_reports \ --output_file $CONSOLIDATED_REPORT_PATH \ --slack_channel_name diffusers-ci-nightly - name: Show consolidated report run: | cat $CONSOLIDATED_REPORT_PATH >> $GITHUB_STEP_SUMMARY - name: Upload consolidated report uses: actions/upload-artifact@v6 with: name: consolidated_test_report path: ${{ env.CONSOLIDATED_REPORT_PATH }} # M1 runner currently not well supported # TODO: (Dhruv) add these back when we setup better testing for Apple Silicon # run_nightly_tests_apple_m1: # name: Nightly PyTorch MPS tests on MacOS # runs-on: [ self-hosted, apple-m1 ] # if: github.event_name == 'schedule' # # steps: # - name: Checkout diffusers # uses: actions/checkout@v6 # with: # fetch-depth: 2 # # - name: Clean checkout # shell: arch -arch arm64 bash {0} # run: | # git clean -fxd # - name: Setup miniconda # uses: ./.github/actions/setup-miniconda # with: # python-version: 3.9 # # - name: Install dependencies # shell: arch -arch arm64 bash {0} # run: | # ${CONDA_RUN} pip install --upgrade pip uv # ${CONDA_RUN} uv pip install -e ".[quality]" # ${CONDA_RUN} uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu # ${CONDA_RUN} uv pip install accelerate@git+https://github.com/huggingface/accelerate # ${CONDA_RUN} uv pip install pytest-reportlog # - name: Environment # shell: arch -arch arm64 bash {0} # run: | # ${CONDA_RUN} python utils/print_env.py # - name: Run nightly PyTorch tests on M1 (MPS) # shell: arch -arch arm64 bash {0} # env: # HF_HOME: /System/Volumes/Data/mnt/cache # HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # run: | # ${CONDA_RUN} pytest -n 1 --make-reports=tests_torch_mps \ # --report-log=tests_torch_mps.log \ # tests/ # - name: Failure short reports # if: ${{ failure() }} # run: cat reports/tests_torch_mps_failures_short.txt # # - name: Test suite reports artifacts # if: ${{ always() }} # uses: actions/upload-artifact@v6 # with: # name: torch_mps_test_reports # path: reports # # - name: Generate Report and Notify Channel # if: always() # run: | # pip install slack_sdk tabulate # python utils/log_reports.py >> $GITHUB_STEP_SUMMARY run_nightly_tests_apple_m1: # name: Nightly PyTorch MPS tests on MacOS # runs-on: [ self-hosted, apple-m1 ] # if: github.event_name == 'schedule' # # steps: # - name: Checkout diffusers # uses: actions/checkout@v6 # with: # fetch-depth: 2 # # - name: Clean checkout # shell: arch -arch arm64 bash {0} # run: | # git clean -fxd # - name: Setup miniconda # uses: ./.github/actions/setup-miniconda # with: # python-version: 3.9 # # - name: Install dependencies # shell: arch -arch arm64 bash {0} # run: | # ${CONDA_RUN} pip install --upgrade pip uv # ${CONDA_RUN} uv pip install -e ".[quality]" # ${CONDA_RUN} uv pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu # ${CONDA_RUN} uv pip install accelerate@git+https://github.com/huggingface/accelerate # ${CONDA_RUN} uv pip install pytest-reportlog # - name: Environment # shell: arch -arch arm64 bash {0} # run: | # ${CONDA_RUN} python utils/print_env.py # - name: Run nightly PyTorch tests on M1 (MPS) # shell: arch -arch arm64 bash {0} # env: # HF_HOME: /System/Volumes/Data/mnt/cache # HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # run: | # ${CONDA_RUN} pytest -n 1 --make-reports=tests_torch_mps \ # --report-log=tests_torch_mps.log \ # tests/ # - name: Failure short reports # if: ${{ failure() }} # run: cat reports/tests_torch_mps_failures_short.txt # # - name: Test suite reports artifacts # if: ${{ always() }} # uses: actions/upload-artifact@v6 # with: # name: torch_mps_test_reports # path: reports # # - name: Generate Report and Notify Channel # if: always() # run: | # pip install slack_sdk tabulate # python utils/log_reports.py >> $GITHUB_STEP_SUMMARY ================================================ FILE: .github/workflows/notify_slack_about_release.yml ================================================ name: Notify Slack about a release on: workflow_dispatch: release: types: [published] jobs: build: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Setup Python uses: actions/setup-python@v6 with: python-version: '3.10' - name: Notify Slack about the release env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} run: pip install requests && python utils/notify_slack_about_release.py ================================================ FILE: .github/workflows/pr_dependency_test.yml ================================================ name: Run dependency tests on: pull_request: branches: - main paths: - "src/diffusers/**.py" push: branches: - main concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true jobs: check_dependencies: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install -e . pip install pytest - name: Check for soft dependencies run: | pytest tests/others/test_dependencies.py ================================================ FILE: .github/workflows/pr_modular_tests.yml ================================================ name: Fast PR tests for Modular on: pull_request: branches: [main] paths: - "src/diffusers/modular_pipelines/**.py" - "src/diffusers/models/modeling_utils.py" - "src/diffusers/models/model_loading_utils.py" - "src/diffusers/pipelines/pipeline_utils.py" - "src/diffusers/pipeline_loading_utils.py" - "src/diffusers/loaders/lora_base.py" - "src/diffusers/loaders/lora_pipeline.py" - "src/diffusers/loaders/peft.py" - "tests/modular_pipelines/**.py" - ".github/**.yml" - "utils/**.py" - "setup.py" push: branches: - ci-* concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true env: DIFFUSERS_IS_CI: yes HF_XET_HIGH_PERFORMANCE: 1 OMP_NUM_THREADS: 4 MKL_NUM_THREADS: 4 PYTEST_TIMEOUT: 60 jobs: check_code_quality: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check quality run: make quality - name: Check if failure if: ${{ failure() }} run: | echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY check_repository_consistency: needs: check_code_quality runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check repo consistency run: | python utils/check_copies.py python utils/check_dummies.py python utils/check_support_list.py make deps_table_check_updated - name: Check if failure if: ${{ failure() }} run: | echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY check_auto_docs: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check auto docs run: make modular-autodoctrings - name: Check if failure if: ${{ failure() }} run: | echo "Auto docstring checks failed. Please run `python utils/modular_auto_docstring.py --fix_and_overwrite`." >> $GITHUB_STEP_SUMMARY run_fast_tests: needs: [check_code_quality, check_repository_consistency, check_auto_docs] name: Fast PyTorch Modular Pipeline CPU tests runs-on: group: aws-highmemory-32-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" #uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall transformers huggingface_hub && uv pip install transformers==4.57.1 uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps - name: Environment run: | python utils/print_env.py - name: Run fast PyTorch Pipeline CPU tests run: | pytest -n 8 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_cpu_modular_pipelines \ tests/modular_pipelines - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_cpu_modular_pipelines_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_pytorch_pipelines_torch_cpu_modular_pipelines_test_reports path: reports ================================================ FILE: .github/workflows/pr_style_bot.yml ================================================ name: PR Style Bot on: issue_comment: types: [created] permissions: contents: write pull-requests: write jobs: style: uses: huggingface/huggingface_hub/.github/workflows/style-bot-action.yml@main with: python_quality_dependencies: "[quality]" secrets: bot_token: ${{ secrets.HF_STYLE_BOT_ACTION }} ================================================ FILE: .github/workflows/pr_test_fetcher.yml ================================================ name: Fast tests for PRs - Test Fetcher on: workflow_dispatch env: DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 4 MKL_NUM_THREADS: 4 PYTEST_TIMEOUT: 60 concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true jobs: setup_pr_tests: name: Setup PR Tests runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash outputs: matrix: ${{ steps.set_matrix.outputs.matrix }} test_map: ${{ steps.set_matrix.outputs.test_map }} steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 0 - name: Install dependencies run: | uv pip install -e ".[quality]" - name: Environment run: | python utils/print_env.py echo $(git --version) - name: Fetch Tests run: | python utils/tests_fetcher.py | tee test_preparation.txt - name: Report fetched tests uses: actions/upload-artifact@v6 with: name: test_fetched path: test_preparation.txt - id: set_matrix name: Create Test Matrix # The `keys` is used as GitHub actions matrix for jobs, i.e. `models`, `pipelines`, etc. # The `test_map` is used to get the actual identified test files under each key. # If no test to run (so no `test_map.json` file), create a dummy map (empty matrix will fail) run: | if [ -f test_map.json ]; then keys=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); d = list(test_map.keys()); print(json.dumps(d))') test_map=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); print(json.dumps(test_map))') else keys=$(python3 -c 'keys = ["dummy"]; print(keys)') test_map=$(python3 -c 'test_map = {"dummy": []}; print(test_map)') fi echo $keys echo $test_map echo "matrix=$keys" >> $GITHUB_OUTPUT echo "test_map=$test_map" >> $GITHUB_OUTPUT run_pr_tests: name: Run PR Tests needs: setup_pr_tests if: contains(fromJson(needs.setup_pr_tests.outputs.matrix), 'dummy') != true strategy: fail-fast: false max-parallel: 2 matrix: modules: ${{ fromJson(needs.setup_pr_tests.outputs.matrix) }} runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install accelerate - name: Environment run: | python utils/print_env.py - name: Run all selected tests on CPU run: | pytest -n 2 --dist=loadfile -v --make-reports=${{ matrix.modules }}_tests_cpu ${{ fromJson(needs.setup_pr_tests.outputs.test_map)[matrix.modules] }} - name: Failure short reports if: ${{ failure() }} continue-on-error: true run: | cat reports/${{ matrix.modules }}_tests_cpu_stats.txt cat reports/${{ matrix.modules }}_tests_cpu_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: ${{ matrix.modules }}_test_reports path: reports run_staging_tests: strategy: fail-fast: false matrix: config: - name: Hub tests for models, schedulers, and pipelines framework: hub_tests_pytorch runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_hub name: ${{ matrix.config.name }} runs-on: group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | pip install -e [quality] - name: Environment run: | python utils/print_env.py - name: Run Hub tests for models, schedulers, and pipelines on a staging env if: ${{ matrix.config.framework == 'hub_tests_pytorch' }} run: | HUGGINGFACE_CO_STAGING=true pytest \ -m "is_staging_test" \ --make-reports=tests_${{ matrix.config.report }} \ tests - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_${{ matrix.config.report }}_test_reports path: reports ================================================ FILE: .github/workflows/pr_tests.yml ================================================ name: Fast tests for PRs on: pull_request: branches: [main] paths: - "src/diffusers/**.py" - "benchmarks/**.py" - "examples/**.py" - "scripts/**.py" - "tests/**.py" - ".github/**.yml" - "utils/**.py" - "setup.py" push: branches: - ci-* permissions: contents: read concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true env: DIFFUSERS_IS_CI: yes HF_XET_HIGH_PERFORMANCE: 1 OMP_NUM_THREADS: 4 MKL_NUM_THREADS: 4 PYTEST_TIMEOUT: 60 jobs: check_code_quality: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check quality run: make quality - name: Check if failure if: ${{ failure() }} run: | echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY check_repository_consistency: needs: check_code_quality runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check repo consistency run: | python utils/check_copies.py python utils/check_dummies.py python utils/check_support_list.py make deps_table_check_updated - name: Check if failure if: ${{ failure() }} run: | echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY run_fast_tests: needs: [check_code_quality, check_repository_consistency] strategy: fail-fast: false matrix: config: - name: Fast PyTorch Pipeline CPU tests framework: pytorch_pipelines runner: aws-highmemory-32-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu_pipelines - name: Fast PyTorch Models & Schedulers CPU tests framework: pytorch_models runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu_models_schedulers - name: PyTorch Example CPU tests framework: pytorch_examples runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_example_cpu name: ${{ matrix.config.name }} runs-on: group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps - name: Environment run: | python utils/print_env.py - name: Run fast PyTorch Pipeline CPU tests if: ${{ matrix.config.framework == 'pytorch_pipelines' }} run: | pytest -n 8 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_${{ matrix.config.report }} \ tests/pipelines - name: Run fast PyTorch Model Scheduler CPU tests if: ${{ matrix.config.framework == 'pytorch_models' }} run: | pytest -n 4 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx and not Dependency" \ --make-reports=tests_${{ matrix.config.report }} \ tests/models tests/schedulers tests/others - name: Run example PyTorch CPU tests if: ${{ matrix.config.framework == 'pytorch_examples' }} run: | uv pip install ".[training]" pytest -n 4 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_${{ matrix.config.report }} \ examples - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_${{ matrix.config.framework }}_${{ matrix.config.report }}_test_reports path: reports run_staging_tests: needs: [check_code_quality, check_repository_consistency] strategy: fail-fast: false matrix: config: - name: Hub tests for models, schedulers, and pipelines framework: hub_tests_pytorch runner: group: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_hub name: ${{ matrix.config.name }} runs-on: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" - name: Environment run: | python utils/print_env.py - name: Run Hub tests for models, schedulers, and pipelines on a staging env if: ${{ matrix.config.framework == 'hub_tests_pytorch' }} run: | HUGGINGFACE_CO_STAGING=true pytest \ -m "is_staging_test" \ --make-reports=tests_${{ matrix.config.report }} \ tests - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_${{ matrix.config.report }}_test_reports path: reports run_lora_tests: needs: [check_code_quality, check_repository_consistency] name: LoRA tests with PEFT main runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" # TODO (sayakpaul, DN6): revisit `--no-deps` uv pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps uv pip install -U tokenizers uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run fast PyTorch LoRA tests with PEFT run: | pytest -n 4 --max-worker-restart=0 --dist=loadfile \ \ --make-reports=tests_peft_main \ tests/lora/ pytest -n 4 --max-worker-restart=0 --dist=loadfile \ \ --make-reports=tests_models_lora_peft_main \ tests/models/ -k "lora" - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_peft_main_failures_short.txt cat reports/tests_models_lora_peft_main_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_lora_test_reports path: reports ================================================ FILE: .github/workflows/pr_tests_gpu.yml ================================================ name: Fast GPU Tests on PR permissions: contents: read on: pull_request: branches: main paths: - "src/diffusers/models/modeling_utils.py" - "src/diffusers/models/model_loading_utils.py" - "src/diffusers/pipelines/pipeline_utils.py" - "src/diffusers/pipeline_loading_utils.py" - "src/diffusers/loaders/lora_base.py" - "src/diffusers/loaders/lora_pipeline.py" - "src/diffusers/loaders/peft.py" - "tests/pipelines/test_pipelines_common.py" - "tests/models/test_modeling_common.py" - "examples/**/*.py" workflow_dispatch: concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true env: DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 HF_XET_HIGH_PERFORMANCE: 1 PYTEST_TIMEOUT: 600 PIPELINE_USAGE_CUTOFF: 1000000000 # set high cutoff so that only always-test pipelines run jobs: check_code_quality: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check quality run: make quality - name: Check if failure if: ${{ failure() }} run: | echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY check_repository_consistency: needs: check_code_quality runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install --upgrade pip pip install .[quality] - name: Check repo consistency run: | python utils/check_copies.py python utils/check_dummies.py python utils/check_support_list.py make deps_table_check_updated - name: Check if failure if: ${{ failure() }} run: | echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY setup_torch_cuda_pipeline_matrix: needs: [check_code_quality, check_repository_consistency] name: Setup Torch Pipelines CUDA Slow Tests Matrix runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" - name: Environment run: | python utils/print_env.py - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix run: | matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) echo $matrix echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - name: Pipeline Tests Artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: test-pipelines.json path: reports torch_pipelines_cuda_tests: name: Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Extract tests id: extract_tests run: | pattern=$(python utils/extract_tests_from_mixin.py --type pipeline) echo "$pattern" > /tmp/test_pattern.txt echo "pattern_file=/tmp/test_pattern.txt" >> $GITHUB_OUTPUT - name: PyTorch CUDA checkpoint tests on Ubuntu env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | if [ "${{ matrix.module }}" = "ip_adapters" ]; then pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ tests/pipelines/${{ matrix.module }} else pattern=$(cat ${{ steps.extract_tests.outputs.pattern_file }}) pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx and $pattern" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ tests/pipelines/${{ matrix.module }} fi - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pipeline_${{ matrix.module }}_test_reports path: reports torch_cuda_tests: name: Torch CUDA Tests needs: [check_code_quality, check_repository_consistency] runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: fail-fast: false max-parallel: 4 matrix: module: [models, schedulers, lora, others] steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Extract tests id: extract_tests run: | pattern=$(python utils/extract_tests_from_mixin.py --type ${{ matrix.module }}) echo "$pattern" > /tmp/test_pattern.txt echo "pattern_file=/tmp/test_pattern.txt" >> $GITHUB_OUTPUT - name: Run PyTorch CUDA tests env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pattern=$(cat ${{ steps.extract_tests.outputs.pattern_file }}) if [ -z "$pattern" ]; then pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "not Flax and not Onnx" tests/${{ matrix.module }} \ --make-reports=tests_torch_cuda_${{ matrix.module }} else pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "not Flax and not Onnx and $pattern" tests/${{ matrix.module }} \ --make-reports=tests_torch_cuda_${{ matrix.module }} fi - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_cuda_${{ matrix.module }}_stats.txt cat reports/tests_torch_cuda_${{ matrix.module }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_test_reports_${{ matrix.module }} path: reports run_examples_tests: name: Examples PyTorch CUDA tests on Ubuntu needs: [check_code_quality, check_repository_consistency] runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git uv pip install -e ".[quality,training]" - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | uv pip install ".[training]" pytest -n 1 --max-worker-restart=0 --dist=loadfile --make-reports=examples_torch_cuda examples/ - name: Failure short reports if: ${{ failure() }} run: | cat reports/examples_torch_cuda_stats.txt cat reports/examples_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: examples_test_reports path: reports ================================================ FILE: .github/workflows/pr_torch_dependency_test.yml ================================================ name: Run Torch dependency tests on: pull_request: branches: - main paths: - "src/diffusers/**.py" push: branches: - main concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true jobs: check_torch_dependencies: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | pip install -e . pip install torch torchvision torchaudio pytest - name: Check for soft dependencies run: | pytest tests/others/test_dependencies.py ================================================ FILE: .github/workflows/push_tests.yml ================================================ name: Fast GPU Tests on main on: workflow_dispatch: push: branches: - main paths: - "src/diffusers/**.py" - "examples/**.py" - "tests/**.py" env: DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 HF_XET_HIGH_PERFORMANCE: 1 PYTEST_TIMEOUT: 600 PIPELINE_USAGE_CUTOFF: 50000 jobs: setup_torch_cuda_pipeline_matrix: name: Setup Torch Pipelines CUDA Slow Tests Matrix runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" - name: Environment run: | python utils/print_env.py - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix run: | matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) echo $matrix echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - name: Pipeline Tests Artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: test-pipelines.json path: reports torch_pipelines_cuda_tests: name: Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: PyTorch CUDA checkpoint tests on Ubuntu env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ tests/pipelines/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pipeline_${{ matrix.module }}_test_reports path: reports torch_cuda_tests: name: Torch CUDA Tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: fail-fast: false max-parallel: 2 matrix: module: [models, schedulers, lora, others, single_file] steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run PyTorch CUDA tests env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_cuda_${{ matrix.module }} \ tests/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_cuda_${{ matrix.module }}_stats.txt cat reports/tests_torch_cuda_${{ matrix.module }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_test_reports_${{ matrix.module }} path: reports run_torch_compile_tests: name: PyTorch Compile CUDA tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} RUN_COMPILE: yes run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "compile" --make-reports=tests_torch_compile_cuda tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_compile_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_compile_test_reports path: reports run_xformers_tests: name: PyTorch xformers CUDA tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-xformers-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_xformers_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_xformers_test_reports path: reports run_examples_tests: name: Examples PyTorch CUDA tests on Ubuntu runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | uv pip install ".[training]" pytest -n 1 --max-worker-restart=0 --dist=loadfile --make-reports=examples_torch_cuda examples/ - name: Failure short reports if: ${{ failure() }} run: | cat reports/examples_torch_cuda_stats.txt cat reports/examples_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: examples_test_reports path: reports ================================================ FILE: .github/workflows/push_tests_fast.yml ================================================ name: Fast tests on main on: push: branches: - main paths: - "src/diffusers/**.py" - "examples/**.py" - "tests/**.py" concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true env: DIFFUSERS_IS_CI: yes HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 HF_XET_HIGH_PERFORMANCE: 1 PYTEST_TIMEOUT: 600 RUN_SLOW: no jobs: run_fast_tests: strategy: fail-fast: false matrix: config: - name: Fast PyTorch CPU tests on Ubuntu framework: pytorch runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_cpu - name: PyTorch Example CPU tests on Ubuntu framework: pytorch_examples runner: aws-general-8-plus image: diffusers/diffusers-pytorch-cpu report: torch_example_cpu name: ${{ matrix.config.name }} runs-on: group: ${{ matrix.config.runner }} container: image: ${{ matrix.config.image }} options: --shm-size "16gb" --ipc host -v /mnt/hf_cache:/mnt/cache/ defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" - name: Environment run: | python utils/print_env.py - name: Run fast PyTorch CPU tests if: ${{ matrix.config.framework == 'pytorch' }} run: | pytest -n 4 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_${{ matrix.config.report }} \ tests/ - name: Run example PyTorch CPU tests if: ${{ matrix.config.framework == 'pytorch_examples' }} run: | uv pip install ".[training]" pytest -n 4 --max-worker-restart=0 --dist=loadfile \ --make-reports=tests_${{ matrix.config.report }} \ examples - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_${{ matrix.config.report }}_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_${{ matrix.config.report }}_test_reports path: reports ================================================ FILE: .github/workflows/push_tests_mps.yml ================================================ name: Fast mps tests on main on: workflow_dispatch: env: DIFFUSERS_IS_CI: yes HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 HF_XET_HIGH_PERFORMANCE: 1 PYTEST_TIMEOUT: 600 RUN_SLOW: no concurrency: group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} cancel-in-progress: true jobs: run_fast_tests_apple_m1: name: Fast PyTorch MPS tests on MacOS runs-on: macos-13-xlarge steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Clean checkout shell: arch -arch arm64 bash {0} run: | git clean -fxd - name: Setup miniconda uses: ./.github/actions/setup-miniconda with: python-version: 3.9 - name: Install dependencies shell: arch -arch arm64 bash {0} run: | ${CONDA_RUN} python -m pip install --upgrade pip uv ${CONDA_RUN} python -m uv pip install -e ".[quality]" ${CONDA_RUN} python -m uv pip install torch torchvision torchaudio ${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git ${CONDA_RUN} python -m uv pip install transformers --upgrade - name: Environment shell: arch -arch arm64 bash {0} run: | ${CONDA_RUN} python utils/print_env.py - name: Run fast PyTorch tests on M1 (MPS) shell: arch -arch arm64 bash {0} env: HF_HOME: /System/Volumes/Data/mnt/cache HF_TOKEN: ${{ secrets.HF_TOKEN }} run: | ${CONDA_RUN} python -m pytest -n 0 --make-reports=tests_torch_mps tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_mps_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pr_torch_mps_test_reports path: reports ================================================ FILE: .github/workflows/pypi_publish.yaml ================================================ # Adapted from https://blog.deepjyoti30.dev/pypi-release-github-action name: PyPI release on: workflow_dispatch: push: tags: - "*" jobs: find-and-checkout-latest-branch: runs-on: ubuntu-22.04 outputs: latest_branch: ${{ steps.set_latest_branch.outputs.latest_branch }} steps: - name: Checkout Repo uses: actions/checkout@v6 - name: Set up Python uses: actions/setup-python@v6 with: python-version: '3.10' - name: Fetch latest branch id: fetch_latest_branch run: | pip install -U requests packaging LATEST_BRANCH=$(python utils/fetch_latest_release_branch.py) echo "Latest branch: $LATEST_BRANCH" echo "latest_branch=$LATEST_BRANCH" >> $GITHUB_ENV - name: Set latest branch output id: set_latest_branch run: echo "::set-output name=latest_branch::${{ env.latest_branch }}" release: needs: find-and-checkout-latest-branch runs-on: ubuntu-22.04 steps: - name: Checkout Repo uses: actions/checkout@v6 with: ref: ${{ needs.find-and-checkout-latest-branch.outputs.latest_branch }} - name: Setup Python uses: actions/setup-python@v6 with: python-version: "3.10" - name: Install dependencies run: | python -m pip install --upgrade pip pip install -U setuptools wheel twine pip install -U torch --index-url https://download.pytorch.org/whl/cpu - name: Build the dist files run: python setup.py bdist_wheel && python setup.py sdist - name: Publish to the test PyPI env: TWINE_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }} TWINE_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }} run: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ - name: Test installing diffusers and importing run: | pip install diffusers && pip uninstall diffusers -y pip install -i https://test.pypi.org/simple/ diffusers pip install -U transformers python utils/print_env.py python -c "from diffusers import __version__; print(__version__)" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')" python -c "from diffusers import *" - name: Publish to PyPI env: TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} run: twine upload dist/* -r pypi ================================================ FILE: .github/workflows/release_tests_fast.yml ================================================ # Duplicate workflow to push_tests.yml that is meant to run on release/patch branches as a final check # Creating a duplicate workflow here is simpler than adding complex path/branch parsing logic to push_tests.yml # Needs to be updated if push_tests.yml updated name: (Release) Fast GPU Tests on main on: workflow_dispatch: push: branches: - "v*.*.*-release" - "v*.*.*-patch" env: DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 PYTEST_TIMEOUT: 600 PIPELINE_USAGE_CUTOFF: 50000 jobs: setup_torch_cuda_pipeline_matrix: name: Setup Torch Pipelines CUDA Slow Tests Matrix runs-on: group: aws-general-8-plus container: image: diffusers/diffusers-pytorch-cpu outputs: pipeline_test_matrix: ${{ steps.fetch_pipeline_matrix.outputs.pipeline_test_matrix }} steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Fetch Pipeline Matrix id: fetch_pipeline_matrix run: | matrix=$(python utils/fetch_torch_cuda_pipeline_test_matrix.py) echo $matrix echo "pipeline_test_matrix=$matrix" >> $GITHUB_OUTPUT - name: Pipeline Tests Artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: test-pipelines.json path: reports torch_pipelines_cuda_tests: name: Torch Pipelines CUDA Tests needs: setup_torch_cuda_pipeline_matrix strategy: fail-fast: false max-parallel: 8 matrix: module: ${{ fromJson(needs.setup_torch_cuda_pipeline_matrix.outputs.pipeline_test_matrix) }} runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Slow PyTorch CUDA checkpoint tests on Ubuntu env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_pipeline_${{ matrix.module }}_cuda \ tests/pipelines/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_pipeline_${{ matrix.module }}_cuda_stats.txt cat reports/tests_pipeline_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: pipeline_${{ matrix.module }}_test_reports path: reports torch_cuda_tests: name: Torch CUDA Tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash strategy: fail-fast: false max-parallel: 2 matrix: module: [models, schedulers, lora, others, single_file] steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run PyTorch CUDA tests env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_${{ matrix.module }}_cuda \ tests/${{ matrix.module }} - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_${{ matrix.module }}_cuda_stats.txt cat reports/tests_torch_${{ matrix.module }}_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_cuda_${{ matrix.module }}_test_reports path: reports torch_minimum_version_cuda_tests: name: Torch Minimum Version CUDA Tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-minimum-cuda options: --shm-size "16gb" --ipc host --gpus all defaults: run: shell: bash steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Install dependencies run: | uv pip install -e ".[quality]" uv pip install peft@git+https://github.com/huggingface/peft.git uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run PyTorch CUDA tests env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} # https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms CUBLAS_WORKSPACE_CONFIG: :16:8 run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile \ -k "not Flax and not Onnx" \ --make-reports=tests_torch_minimum_cuda \ tests/models/test_modeling_common.py \ tests/pipelines/test_pipelines_common.py \ tests/pipelines/test_pipeline_utils.py \ tests/pipelines/test_pipelines.py \ tests/pipelines/test_pipelines_auto.py \ tests/schedulers/test_schedulers.py \ tests/others - name: Failure short reports if: ${{ failure() }} run: | cat reports/tests_torch_minimum_version_cuda_stats.txt cat reports/tests_torch_minimum_version_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_minimum_version_cuda_test_reports path: reports run_torch_compile_tests: name: PyTorch Compile CUDA tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run torch compile tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} RUN_COMPILE: yes run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "compile" --make-reports=tests_torch_compile_cuda tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_compile_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_compile_test_reports path: reports run_xformers_tests: name: PyTorch xformers CUDA tests runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-xformers-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | pytest -n 1 --max-worker-restart=0 --dist=loadfile -k "xformers" --make-reports=tests_torch_xformers_cuda tests/ - name: Failure short reports if: ${{ failure() }} run: cat reports/tests_torch_xformers_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: torch_xformers_test_reports path: reports run_examples_tests: name: Examples PyTorch CUDA tests on Ubuntu runs-on: group: aws-g4dn-2xlarge container: image: diffusers/diffusers-pytorch-cuda options: --gpus all --shm-size "16gb" --ipc host steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Install dependencies run: | uv pip install -e ".[quality,training]" uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git - name: Environment run: | python utils/print_env.py - name: Run example tests on GPU env: HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }} run: | uv pip install ".[training]" pytest -n 1 --max-worker-restart=0 --dist=loadfile --make-reports=examples_torch_cuda examples/ - name: Failure short reports if: ${{ failure() }} run: | cat reports/examples_torch_cuda_stats.txt cat reports/examples_torch_cuda_failures_short.txt - name: Test suite reports artifacts if: ${{ always() }} uses: actions/upload-artifact@v6 with: name: examples_test_reports path: reports ================================================ FILE: .github/workflows/run_tests_from_a_pr.yml ================================================ name: Check running SLOW tests from a PR (only GPU) on: workflow_dispatch: inputs: docker_image: default: 'diffusers/diffusers-pytorch-cuda' description: 'Name of the Docker image' required: true pr_number: description: 'PR number to test on' required: true test: description: 'Tests to run (e.g.: `tests/models`).' required: true env: DIFFUSERS_IS_CI: yes IS_GITHUB_CI: "1" HF_HOME: /mnt/cache OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 PYTEST_TIMEOUT: 600 RUN_SLOW: yes jobs: run_tests: name: "Run a test on our runner from a PR" runs-on: group: aws-g4dn-2xlarge container: image: ${{ github.event.inputs.docker_image }} options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ steps: - name: Validate test files input id: validate_test_files env: PY_TEST: ${{ github.event.inputs.test }} run: | if [[ ! "$PY_TEST" =~ ^tests/ ]]; then echo "Error: The input string must start with 'tests/'." exit 1 fi if [[ ! "$PY_TEST" =~ ^tests/(models|pipelines|lora) ]]; then echo "Error: The input string must contain either 'models', 'pipelines', or 'lora' after 'tests/'." exit 1 fi if [[ "$PY_TEST" == *";"* ]]; then echo "Error: The input string must not contain ';'." exit 1 fi echo "$PY_TEST" shell: bash -e {0} - name: Checkout PR branch uses: actions/checkout@v6 with: ref: refs/pull/${{ inputs.pr_number }}/head - name: Install pytest run: | uv pip install -e ".[quality]" uv pip install peft - name: Run tests env: PY_TEST: ${{ github.event.inputs.test }} run: | pytest "$PY_TEST" ================================================ FILE: .github/workflows/ssh-pr-runner.yml ================================================ name: SSH into PR runners on: workflow_dispatch: inputs: docker_image: description: 'Name of the Docker image' required: true env: IS_GITHUB_CI: "1" HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} HF_HOME: /mnt/cache DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 RUN_SLOW: yes jobs: ssh_runner: name: "SSH" runs-on: group: aws-highmemory-32-plus container: image: ${{ github.event.inputs.docker_image }} options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --privileged steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: Tailscale # In order to be able to SSH when a test fails uses: huggingface/tailscale-action@main with: authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} waitForSSH: true ================================================ FILE: .github/workflows/ssh-runner.yml ================================================ name: SSH into GPU runners on: workflow_dispatch: inputs: runner_type: description: 'Type of runner to test (aws-g6-4xlarge-plus: a10, aws-g4dn-2xlarge: t4, aws-g6e-xlarge-plus: L40)' type: choice required: true options: - aws-g6-4xlarge-plus - aws-g4dn-2xlarge - aws-g6e-xlarge-plus docker_image: description: 'Name of the Docker image' required: true env: IS_GITHUB_CI: "1" HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} HF_HOME: /mnt/cache DIFFUSERS_IS_CI: yes OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 RUN_SLOW: yes jobs: ssh_runner: name: "SSH" runs-on: group: "${{ github.event.inputs.runner_type }}" container: image: ${{ github.event.inputs.docker_image }} options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --gpus all --privileged steps: - name: Checkout diffusers uses: actions/checkout@v6 with: fetch-depth: 2 - name: NVIDIA-SMI run: | nvidia-smi - name: Tailscale # In order to be able to SSH when a test fails uses: huggingface/tailscale-action@main with: authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }} slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }} slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }} waitForSSH: true ================================================ FILE: .github/workflows/stale.yml ================================================ name: Stale Bot on: schedule: - cron: "0 15 * * *" jobs: close_stale_issues: name: Close Stale Issues if: github.repository == 'huggingface/diffusers' runs-on: ubuntu-22.04 permissions: issues: write pull-requests: write env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} steps: - uses: actions/checkout@v6 - name: Setup Python uses: actions/setup-python@v6 with: python-version: 3.10 - name: Install requirements run: | pip install PyGithub - name: Close stale issues run: | python utils/stale.py ================================================ FILE: .github/workflows/trufflehog.yml ================================================ on: push: name: Secret Leaks jobs: trufflehog: runs-on: ubuntu-22.04 steps: - name: Checkout code uses: actions/checkout@v6 with: fetch-depth: 0 - name: Secret Scanning uses: trufflesecurity/trufflehog@main with: extra_args: --results=verified,unknown ================================================ FILE: .github/workflows/typos.yml ================================================ name: Check typos on: workflow_dispatch: jobs: build: runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v6 - name: typos-action uses: crate-ci/typos@v1.42.1 ================================================ FILE: .github/workflows/update_metadata.yml ================================================ name: Update Diffusers metadata on: workflow_dispatch: push: branches: - main - update_diffusers_metadata* jobs: update_metadata: runs-on: ubuntu-22.04 defaults: run: shell: bash -l {0} steps: - uses: actions/checkout@v6 - name: Setup environment run: | pip install --upgrade pip pip install datasets pandas pip install .[torch] - name: Update metadata env: HF_TOKEN: ${{ secrets.SAYAK_HF_TOKEN }} run: | python utils/update_metadata.py --commit_sha ${{ github.sha }} ================================================ FILE: .github/workflows/upload_pr_documentation.yml ================================================ name: Upload PR Documentation on: workflow_run: workflows: ["Build PR Documentation"] types: - completed jobs: build: uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main with: package_name: diffusers secrets: hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }} ================================================ FILE: .gitignore ================================================ # Initially taken from GitHub's Python gitignore file # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # tests and logs tests/fixtures/cached_*_text.txt logs/ lightning_logs/ lang_code_data/ # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a Python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # vscode .vs .vscode # Cursor .cursor # Pycharm .idea # TF code tensorflow_code # Models proc_data # examples runs /runs_old /wandb /examples/runs /examples/**/*.args /examples/rag/sweep # data /data serialization_dir # emacs *.*~ debug.env # vim .*.swp # ctags tags # pre-commit .pre-commit* # .lock *.lock # DS_Store (MacOS) .DS_Store # RL pipelines may produce mp4 outputs *.mp4 # dependencies /transformers # ruff .ruff_cache # wandb wandb # AI agent generated symlinks /AGENTS.md /CLAUDE.md /.agents/skills /.claude/skills ================================================ FILE: CITATION.cff ================================================ cff-version: 1.2.0 title: 'Diffusers: State-of-the-art diffusion models' message: >- If you use this software, please cite it using the metadata from this file. type: software authors: - given-names: Patrick family-names: von Platen - given-names: Suraj family-names: Patil - given-names: Anton family-names: Lozhkov - given-names: Pedro family-names: Cuenca - given-names: Nathan family-names: Lambert - given-names: Kashif family-names: Rasul - given-names: Mishig family-names: Davaadorj - given-names: Dhruv family-names: Nair - given-names: Sayak family-names: Paul - given-names: Steven family-names: Liu - given-names: William family-names: Berman - given-names: Yiyi family-names: Xu - given-names: Thomas family-names: Wolf repository-code: 'https://github.com/huggingface/diffusers' abstract: >- Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models. keywords: - deep-learning - pytorch - image-generation - hacktoberfest - diffusion - text2image - image2image - score-based-generative-modeling - stable-diffusion - stable-diffusion-diffusers license: Apache-2.0 version: 0.12.1 ================================================ FILE: CODE_OF_CONDUCT.md ================================================ # Contributor Covenant Code of Conduct ## Our Pledge We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation. We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. ## Our Standards Examples of behavior that contributes to a positive environment for our community include: * Demonstrating empathy and kindness toward other people * Being respectful of differing opinions, viewpoints, and experiences * Giving and gracefully accepting constructive feedback * Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience * Focusing on what is best not just for us as individuals, but for the overall Diffusers community Examples of unacceptable behavior include: * The use of sexualized language or imagery, and sexual attention or advances of any kind * Trolling, insulting or derogatory comments, and personal or political attacks * Public or private harassment * Publishing others' private information, such as a physical or email address, without their explicit permission * Spamming issues or PRs with links to projects unrelated to this library * Other conduct which could reasonably be considered inappropriate in a professional setting ## Enforcement Responsibilities Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful. Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate. ## Scope This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at feedback@huggingface.co. All complaints will be reviewed and investigated promptly and fairly. All community leaders are obligated to respect the privacy and security of the reporter of any incident. ## Enforcement Guidelines Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct: ### 1. Correction **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community. **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested. ### 2. Warning **Community Impact**: A violation through a single incident or series of actions. **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban. ### 3. Temporary Ban **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior. **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban. ### 4. Permanent Ban **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals. **Consequence**: A permanent ban from any sort of public interaction within the community. ## Attribution This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html. Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity). [homepage]: https://www.contributor-covenant.org For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations. ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, Any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: MANIFEST.in ================================================ include LICENSE include src/diffusers/utils/model_card_template.md ================================================ FILE: Makefile ================================================ .PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples codex claude clean-ai # make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!) export PYTHONPATH = src check_dirs := examples scripts src tests utils benchmarks modified_only_fixup: $(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs))) @if test -n "$(modified_py_files)"; then \ echo "Checking/fixing $(modified_py_files)"; \ ruff check $(modified_py_files) --fix; \ ruff format $(modified_py_files);\ else \ echo "No library .py files were modified"; \ fi # Update src/diffusers/dependency_versions_table.py deps_table_update: @python setup.py deps_table_update deps_table_check_updated: @md5sum src/diffusers/dependency_versions_table.py > md5sum.saved @python setup.py deps_table_update @md5sum -c --quiet md5sum.saved || (printf "\nError: the version dependency table is outdated.\nPlease run 'make fixup' or 'make style' and commit the changes.\n\n" && exit 1) @rm md5sum.saved # autogenerating code autogenerate_code: deps_table_update # Check that the repo is in a good state repo-consistency: python utils/check_dummies.py python utils/check_repo.py python utils/check_inits.py # this target runs checks on all files quality: ruff check $(check_dirs) setup.py ruff format --check $(check_dirs) setup.py doc-builder style src/diffusers docs/source --max_len 119 --check_only python utils/check_doc_toc.py # Format source code automatically and check is there are any problems left that need manual fixing extra_style_checks: python utils/custom_init_isort.py python utils/check_doc_toc.py --fix_and_overwrite # this target runs checks on all files and potentially modifies some of them style: ruff check $(check_dirs) setup.py --fix ruff format $(check_dirs) setup.py doc-builder style src/diffusers docs/source --max_len 119 ${MAKE} autogenerate_code ${MAKE} extra_style_checks # Super fast fix and check target that only works on relevant modified files since the branch was made fixup: modified_only_fixup extra_style_checks autogenerate_code repo-consistency # Make marked copies of snippets of codes conform to the original fix-copies: python utils/check_copies.py --fix_and_overwrite python utils/check_dummies.py --fix_and_overwrite # Auto docstrings in modular blocks modular-autodoctrings: python utils/modular_auto_docstring.py # Run tests for the library test: python -m pytest -n auto --dist=loadfile -s -v ./tests/ # Run tests for examples test-examples: python -m pytest -n auto --dist=loadfile -s -v ./examples/ # Release stuff pre-release: python utils/release.py pre-patch: python utils/release.py --patch post-release: python utils/release.py --post_release post-patch: python utils/release.py --post_release --patch # AI agent symlinks codex: ln -snf .ai/AGENTS.md AGENTS.md mkdir -p .agents rm -rf .agents/skills ln -snf ../.ai/skills .agents/skills claude: ln -snf .ai/AGENTS.md CLAUDE.md mkdir -p .claude rm -rf .claude/skills ln -snf ../.ai/skills .claude/skills clean-ai: rm -f AGENTS.md CLAUDE.md rm -rf .agents/skills .claude/skills ================================================ FILE: PHILOSOPHY.md ================================================ # Philosophy 🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities. Its purpose is to serve as a **modular toolbox** for both inference and training. We aim to build a library that stands the test of time and therefore take API design very seriously. In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones: ## Usability over Performance - While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. - Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. - Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired. ## Simple over easy As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: - We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management. - Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. - Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. - Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the UNet, and the variational autoencoder, each has their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline. ## Tweakable, contributor-friendly over abstraction For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. **However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: - Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. - Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions. - Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel. At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond). Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). ## Design Philosophy in Details Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). Let's walk through more detailed design decisions for each class. ### Pipelines Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference. The following design principles are followed: - Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251). - Pipelines all inherit from [`DiffusionPipeline`]. - Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function. - Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function. - Pipelines should be used **only** for inference. - Pipelines should be very readable, self-explanatory, and easy to tweak. - Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. - Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). - Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. - Pipelines should be named after the task they are intended to solve. - In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. ### Models Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**. The following design principles are followed: - Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. - All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc... - Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. - Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages. - Models all inherit from `ModelMixin` and `ConfigMixin`. - Models can be optimized for performance when it doesn’t demand major code changes, keep backward compatibility, and give significant memory or compute gain. - Models should by default have the highest precision and lowest performance setting. - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). ### Schedulers Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**. The following design principles are followed: - All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). - Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. - One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper). - If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism. - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. - Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./docs/source/en/using-diffusers/schedulers.md). - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". - In almost all cases, novel schedulers shall be implemented in a new scheduling file. ================================================ FILE: README.md ================================================



GitHub GitHub release GitHub release Contributor Covenant X account

🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction). 🤗 Diffusers offers three core components: - State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code. - Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality. - Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. ## Installation We recommend installing 🤗 Diffusers in a virtual environment from PyPI or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/), please refer to their official documentation. ### PyTorch With `pip` (official package): ```bash pip install --upgrade diffusers[torch] ``` With `conda` (maintained by the community): ```sh conda install -c conda-forge diffusers ``` ### Apple Silicon (M1/M2) support Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggingface.co/docs/diffusers/optimization/mps) guide. ## Quickstart Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 30,000+ checkpoints): ```python from diffusers import DiffusionPipeline import torch pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16) pipeline.to("cuda") pipeline("An image of a squirrel in Picasso style").images[0] ``` You can also dig into the models and schedulers toolbox to build your own diffusion system: ```python from diffusers import DDPMScheduler, UNet2DModel from PIL import Image import torch scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256") model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda") scheduler.set_timesteps(50) sample_size = model.config.sample_size noise = torch.randn((1, 3, sample_size, sample_size), device="cuda") input = noise for t in scheduler.timesteps: with torch.no_grad(): noisy_residual = model(input, t).sample prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample input = prev_noisy_sample image = (input / 2 + 0.5).clamp(0, 1) image = image.cpu().permute(0, 2, 3, 1).numpy()[0] image = Image.fromarray((image * 255).round().astype("uint8")) image ``` Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to launch your diffusion journey today! ## How to navigate the documentation | **Documentation** | **What can I learn?** | |---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview) | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model. | | [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading) | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers. | | [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/overview_techniques) | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library. | | [Optimization](https://huggingface.co/docs/diffusers/optimization/fp16) | Guides for how to optimize your diffusion model to run faster and consume less memory. | | [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques. | ## Contribution We ❤️ contributions from the open-source community! If you want to contribute to this library, please check out our [Contribution guide](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md). You can look out for [issues](https://github.com/huggingface/diffusers/issues) you'd like to tackle to contribute to the library. - See [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) for general opportunities to contribute - See [New model/pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) to contribute exciting new diffusion models / diffusion pipelines - See [New scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) Also, say 👋 in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out ☕. ## Popular Tasks & Pipelines
Task Pipeline 🤗 Hub
Unconditional Image Generation DDPM google/ddpm-ema-church-256
Text-to-Image Stable Diffusion Text-to-Image stable-diffusion-v1-5/stable-diffusion-v1-5
Text-to-Image unCLIP kakaobrain/karlo-v1-alpha
Text-to-Image DeepFloyd IF DeepFloyd/IF-I-XL-v1.0
Text-to-Image Kandinsky kandinsky-community/kandinsky-2-2-decoder
Text-guided Image-to-Image ControlNet lllyasviel/sd-controlnet-canny
Text-guided Image-to-Image InstructPix2Pix timbrooks/instruct-pix2pix
Text-guided Image-to-Image Stable Diffusion Image-to-Image stable-diffusion-v1-5/stable-diffusion-v1-5
Text-guided Image Inpainting Stable Diffusion Inpainting stable-diffusion-v1-5/stable-diffusion-inpainting
Image Variation Stable Diffusion Image Variation lambdalabs/sd-image-variations-diffusers
Super Resolution Stable Diffusion Upscale stabilityai/stable-diffusion-x4-upscaler
Super Resolution Stable Diffusion Latent Upscale stabilityai/sd-x2-latent-upscaler
## Popular libraries using 🧨 Diffusers - https://github.com/microsoft/TaskMatrix - https://github.com/invoke-ai/InvokeAI - https://github.com/InstantID/InstantID - https://github.com/apple/ml-stable-diffusion - https://github.com/Sanster/lama-cleaner - https://github.com/IDEA-Research/Grounded-Segment-Anything - https://github.com/ashawkey/stable-dreamfusion - https://github.com/deep-floyd/IF - https://github.com/bentoml/BentoML - https://github.com/bmaltais/kohya_ss - +14,000 other amazing GitHub repositories 💪 Thank you for using us ❤️. ## Credits This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today: - @CompVis' latent diffusion models library, available [here](https://github.com/CompVis/latent-diffusion) - @hojonathanho original DDPM implementation, available [here](https://github.com/hojonathanho/diffusion) as well as the extremely useful translation into PyTorch by @pesser, available [here](https://github.com/pesser/pytorch_diffusion) - @ermongroup's DDIM implementation, available [here](https://github.com/ermongroup/ddim) - @yang-song's Score-VE and Score-VP implementations, available [here](https://github.com/yang-song/score_sde_pytorch) We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available [here](https://github.com/heejkoo/Awesome-Diffusion-Models) as well as @crowsonkb and @rromb for useful discussions and insights. ## Citation ```bibtex @misc{von-platen-etal-2022-diffusers, author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Dhruv Nair and Sayak Paul and William Berman and Yiyi Xu and Steven Liu and Thomas Wolf}, title = {Diffusers: State-of-the-art diffusion models}, year = {2022}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huggingface/diffusers}} } ``` ================================================ FILE: _typos.toml ================================================ # Files for typos # Instruction: https://github.com/marketplace/actions/typos-action#getting-started [default.extend-identifiers] [default.extend-words] NIN="NIN" # NIN is used in scripts/convert_ncsnpp_original_checkpoint_to_diffusers.py nd="np" # nd may be np (numpy) parms="parms" # parms is used in scripts/convert_original_stable_diffusion_to_diffusers.py [files] extend-exclude = ["_typos.toml"] ================================================ FILE: benchmarks/README.md ================================================ # Diffusers Benchmarks Welcome to Diffusers Benchmarks. These benchmarks are use to obtain latency and memory information of the most popular models across different scenarios such as: * Base case i.e., when using `torch.bfloat16` and `torch.nn.functional.scaled_dot_product_attention`. * Base + `torch.compile()` * NF4 quantization * Layerwise upcasting Instead of full diffusion pipelines, only the forward pass of the respective model classes (such as `FluxTransformer2DModel`) is tested with the real checkpoints (such as `"black-forest-labs/FLUX.1-dev"`). The entrypoint to running all the currently available benchmarks is in `run_all.py`. However, one can run the individual benchmarks, too, e.g., `python benchmarking_flux.py`. It should produce a CSV file containing various information about the benchmarks run. The benchmarks are run on a weekly basis and the CI is defined in [benchmark.yml](../.github/workflows/benchmark.yml). ## Running the benchmarks manually First set up `torch` and install `diffusers` from the root of the directory: ```py pip install -e ".[quality,test]" ``` Then make sure the other dependencies are installed: ```sh cd benchmarks/ pip install -r requirements.txt ``` We need to be authenticated to access some of the checkpoints used during benchmarking: ```sh hf auth login ``` We use an L40 GPU with 128GB RAM to run the benchmark CI. As such, the benchmarks are configured to run on NVIDIA GPUs. So, make sure you have access to a similar machine (or modify the benchmarking scripts accordingly). Then you can either launch the entire benchmarking suite by running: ```sh python run_all.py ``` Or, you can run the individual benchmarks. ## Customizing the benchmarks We define "scenarios" to cover the most common ways in which these models are used. You can define a new scenario, modifying an existing benchmark file: ```py BenchmarkScenario( name=f"{CKPT_ID}-bnb-8bit", model_cls=FluxTransformer2DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", "quantization_config": BitsAndBytesConfig(load_in_8bit=True), }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, ) ``` You can also configure a new model-level benchmark and add it to the existing suite. To do so, just defining a valid benchmarking file like `benchmarking_flux.py` should be enough. Happy benchmarking 🧨 ================================================ FILE: benchmarks/__init__.py ================================================ ================================================ FILE: benchmarks/benchmarking_flux.py ================================================ from functools import partial import torch from benchmarking_utils import BenchmarkMixin, BenchmarkScenario, model_init_fn from diffusers import BitsAndBytesConfig, FluxTransformer2DModel from diffusers.utils.testing_utils import torch_device CKPT_ID = "black-forest-labs/FLUX.1-dev" RESULT_FILENAME = "flux.csv" def get_input_dict(**device_dtype_kwargs): # resolution: 1024x1024 # maximum sequence length 512 hidden_states = torch.randn(1, 4096, 64, **device_dtype_kwargs) encoder_hidden_states = torch.randn(1, 512, 4096, **device_dtype_kwargs) pooled_prompt_embeds = torch.randn(1, 768, **device_dtype_kwargs) image_ids = torch.ones(512, 3, **device_dtype_kwargs) text_ids = torch.ones(4096, 3, **device_dtype_kwargs) timestep = torch.tensor([1.0], **device_dtype_kwargs) guidance = torch.tensor([1.0], **device_dtype_kwargs) return { "hidden_states": hidden_states, "encoder_hidden_states": encoder_hidden_states, "img_ids": image_ids, "txt_ids": text_ids, "pooled_projections": pooled_prompt_embeds, "timestep": timestep, "guidance": guidance, } if __name__ == "__main__": scenarios = [ BenchmarkScenario( name=f"{CKPT_ID}-bf16", model_cls=FluxTransformer2DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, compile_kwargs={"fullgraph": True}, ), BenchmarkScenario( name=f"{CKPT_ID}-bnb-nf4", model_cls=FluxTransformer2DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", "quantization_config": BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4" ), }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, ), BenchmarkScenario( name=f"{CKPT_ID}-layerwise-upcasting", model_cls=FluxTransformer2DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial(model_init_fn, layerwise_upcasting=True), ), BenchmarkScenario( name=f"{CKPT_ID}-group-offload-leaf", model_cls=FluxTransformer2DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial( model_init_fn, group_offload_kwargs={ "onload_device": torch_device, "offload_device": torch.device("cpu"), "offload_type": "leaf_level", "use_stream": True, "non_blocking": True, }, ), ), ] runner = BenchmarkMixin() runner.run_bencmarks_and_collate(scenarios, filename=RESULT_FILENAME) ================================================ FILE: benchmarks/benchmarking_ltx.py ================================================ from functools import partial import torch from benchmarking_utils import BenchmarkMixin, BenchmarkScenario, model_init_fn from diffusers import LTXVideoTransformer3DModel from diffusers.utils.testing_utils import torch_device CKPT_ID = "Lightricks/LTX-Video-0.9.7-dev" RESULT_FILENAME = "ltx.csv" def get_input_dict(**device_dtype_kwargs): # 512x704 (161 frames) # `max_sequence_length`: 256 hidden_states = torch.randn(1, 7392, 128, **device_dtype_kwargs) encoder_hidden_states = torch.randn(1, 256, 4096, **device_dtype_kwargs) encoder_attention_mask = torch.ones(1, 256, **device_dtype_kwargs) timestep = torch.tensor([1.0], **device_dtype_kwargs) video_coords = torch.randn(1, 3, 7392, **device_dtype_kwargs) return { "hidden_states": hidden_states, "encoder_hidden_states": encoder_hidden_states, "encoder_attention_mask": encoder_attention_mask, "timestep": timestep, "video_coords": video_coords, } if __name__ == "__main__": scenarios = [ BenchmarkScenario( name=f"{CKPT_ID}-bf16", model_cls=LTXVideoTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, compile_kwargs={"fullgraph": True}, ), BenchmarkScenario( name=f"{CKPT_ID}-layerwise-upcasting", model_cls=LTXVideoTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial(model_init_fn, layerwise_upcasting=True), ), BenchmarkScenario( name=f"{CKPT_ID}-group-offload-leaf", model_cls=LTXVideoTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial( model_init_fn, group_offload_kwargs={ "onload_device": torch_device, "offload_device": torch.device("cpu"), "offload_type": "leaf_level", "use_stream": True, "non_blocking": True, }, ), ), ] runner = BenchmarkMixin() runner.run_bencmarks_and_collate(scenarios, filename=RESULT_FILENAME) ================================================ FILE: benchmarks/benchmarking_sdxl.py ================================================ from functools import partial import torch from benchmarking_utils import BenchmarkMixin, BenchmarkScenario, model_init_fn from diffusers import UNet2DConditionModel from diffusers.utils.testing_utils import torch_device CKPT_ID = "stabilityai/stable-diffusion-xl-base-1.0" RESULT_FILENAME = "sdxl.csv" def get_input_dict(**device_dtype_kwargs): # height: 1024 # width: 1024 # max_sequence_length: 77 hidden_states = torch.randn(1, 4, 128, 128, **device_dtype_kwargs) encoder_hidden_states = torch.randn(1, 77, 2048, **device_dtype_kwargs) timestep = torch.tensor([1.0], **device_dtype_kwargs) added_cond_kwargs = { "text_embeds": torch.randn(1, 1280, **device_dtype_kwargs), "time_ids": torch.ones(1, 6, **device_dtype_kwargs), } return { "sample": hidden_states, "encoder_hidden_states": encoder_hidden_states, "timestep": timestep, "added_cond_kwargs": added_cond_kwargs, } if __name__ == "__main__": scenarios = [ BenchmarkScenario( name=f"{CKPT_ID}-bf16", model_cls=UNet2DConditionModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "unet", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, compile_kwargs={"fullgraph": True}, ), BenchmarkScenario( name=f"{CKPT_ID}-layerwise-upcasting", model_cls=UNet2DConditionModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "unet", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial(model_init_fn, layerwise_upcasting=True), ), BenchmarkScenario( name=f"{CKPT_ID}-group-offload-leaf", model_cls=UNet2DConditionModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "unet", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial( model_init_fn, group_offload_kwargs={ "onload_device": torch_device, "offload_device": torch.device("cpu"), "offload_type": "leaf_level", "use_stream": True, "non_blocking": True, }, ), ), ] runner = BenchmarkMixin() runner.run_bencmarks_and_collate(scenarios, filename=RESULT_FILENAME) ================================================ FILE: benchmarks/benchmarking_utils.py ================================================ import gc import inspect import logging import os import queue import threading from contextlib import nullcontext from dataclasses import dataclass from typing import Any, Callable import pandas as pd import torch import torch.utils.benchmark as benchmark from diffusers.models.modeling_utils import ModelMixin from diffusers.utils.testing_utils import require_torch_gpu, torch_device logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s") logger = logging.getLogger(__name__) NUM_WARMUP_ROUNDS = 5 def benchmark_fn(f, *args, **kwargs): t0 = benchmark.Timer( stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}, num_threads=1, ) return float(f"{(t0.blocked_autorange().mean):.3f}") def flush(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_max_memory_allocated() torch.cuda.reset_peak_memory_stats() # Adapted from https://github.com/lucasb-eyer/cnn_vit_benchmarks/blob/15b665ff758e8062131353076153905cae00a71f/main.py def calculate_flops(model, input_dict): try: from torchprofile import profile_macs except ModuleNotFoundError: raise # This is a hacky way to convert the kwargs to args as `profile_macs` cries about kwargs. sig = inspect.signature(model.forward) param_names = [ p.name for p in sig.parameters.values() if p.kind in ( inspect.Parameter.POSITIONAL_ONLY, inspect.Parameter.POSITIONAL_OR_KEYWORD, ) and p.name != "self" ] bound = sig.bind_partial(**input_dict) bound.apply_defaults() args = tuple(bound.arguments[name] for name in param_names) model.eval() with torch.no_grad(): macs = profile_macs(model, args) flops = 2 * macs # 1 MAC operation = 2 FLOPs (1 multiplication + 1 addition) return flops def calculate_params(model): return sum(p.numel() for p in model.parameters()) # Users can define their own in case this doesn't suffice. For most cases, # it should be sufficient. def model_init_fn(model_cls, group_offload_kwargs=None, layerwise_upcasting=False, **init_kwargs): model = model_cls.from_pretrained(**init_kwargs).eval() if group_offload_kwargs and isinstance(group_offload_kwargs, dict): model.enable_group_offload(**group_offload_kwargs) else: model.to(torch_device) if layerwise_upcasting: model.enable_layerwise_casting( storage_dtype=torch.float8_e4m3fn, compute_dtype=init_kwargs.get("torch_dtype", torch.bfloat16) ) return model @dataclass class BenchmarkScenario: name: str model_cls: ModelMixin model_init_kwargs: dict[str, Any] model_init_fn: Callable get_model_input_dict: Callable compile_kwargs: dict[str, Any] | None = None @require_torch_gpu class BenchmarkMixin: def pre_benchmark(self): flush() torch.compiler.reset() def post_benchmark(self, model): model.cpu() flush() torch.compiler.reset() @torch.no_grad() def run_benchmark(self, scenario: BenchmarkScenario): # 0) Basic stats logger.info(f"Running scenario: {scenario.name}.") try: model = model_init_fn(scenario.model_cls, **scenario.model_init_kwargs) num_params = round(calculate_params(model) / 1e9, 2) try: flops = round(calculate_flops(model, input_dict=scenario.get_model_input_dict()) / 1e9, 2) except Exception as e: logger.info(f"Problem in calculating FLOPs:\n{e}") flops = None model.cpu() del model except Exception as e: logger.info(f"Error while initializing the model and calculating FLOPs:\n{e}") return {} self.pre_benchmark() # 1) plain stats results = {} plain = None try: plain = self._run_phase( model_cls=scenario.model_cls, init_fn=scenario.model_init_fn, init_kwargs=scenario.model_init_kwargs, get_input_fn=scenario.get_model_input_dict, compile_kwargs=None, ) except Exception as e: logger.info(f"Benchmark could not be run with the following error:\n{e}") return results # 2) compiled stats (if any) compiled = {"time": None, "memory": None} if scenario.compile_kwargs: try: compiled = self._run_phase( model_cls=scenario.model_cls, init_fn=scenario.model_init_fn, init_kwargs=scenario.model_init_kwargs, get_input_fn=scenario.get_model_input_dict, compile_kwargs=scenario.compile_kwargs, ) except Exception as e: logger.info(f"Compilation benchmark could not be run with the following error\n: {e}") if plain is None: return results # 3) merge result = { "scenario": scenario.name, "model_cls": scenario.model_cls.__name__, "num_params_B": num_params, "flops_G": flops, "time_plain_s": plain["time"], "mem_plain_GB": plain["memory"], "time_compile_s": compiled["time"], "mem_compile_GB": compiled["memory"], } if scenario.compile_kwargs: result["fullgraph"] = scenario.compile_kwargs.get("fullgraph", False) result["mode"] = scenario.compile_kwargs.get("mode", "default") else: result["fullgraph"], result["mode"] = None, None return result def run_bencmarks_and_collate(self, scenarios: BenchmarkScenario | list[BenchmarkScenario], filename: str): if not isinstance(scenarios, list): scenarios = [scenarios] record_queue = queue.Queue() stop_signal = object() def _writer_thread(): while True: item = record_queue.get() if item is stop_signal: break df_row = pd.DataFrame([item]) write_header = not os.path.exists(filename) df_row.to_csv(filename, mode="a", header=write_header, index=False) record_queue.task_done() record_queue.task_done() writer = threading.Thread(target=_writer_thread, daemon=True) writer.start() for s in scenarios: try: record = self.run_benchmark(s) if record: record_queue.put(record) else: logger.info(f"Record empty from scenario: {s.name}.") except Exception as e: logger.info(f"Running scenario ({s.name}) led to error:\n{e}") record_queue.put(stop_signal) logger.info(f"Results serialized to {filename=}.") def _run_phase( self, *, model_cls: ModelMixin, init_fn: Callable, init_kwargs: dict[str, Any], get_input_fn: Callable, compile_kwargs: dict[str, Any] | None = None, ) -> dict[str, float]: # setup self.pre_benchmark() # init & (optional) compile model = init_fn(model_cls, **init_kwargs) if compile_kwargs: model.compile(**compile_kwargs) # build inputs inp = get_input_fn() # measure run_ctx = torch._inductor.utils.fresh_inductor_cache() if compile_kwargs else nullcontext() with run_ctx: for _ in range(NUM_WARMUP_ROUNDS): _ = model(**inp) time_s = benchmark_fn(lambda m, d: m(**d), model, inp) mem_gb = torch.cuda.max_memory_allocated() / (1024**3) mem_gb = round(mem_gb, 2) # teardown self.post_benchmark(model) del model return {"time": time_s, "memory": mem_gb} ================================================ FILE: benchmarks/benchmarking_wan.py ================================================ from functools import partial import torch from benchmarking_utils import BenchmarkMixin, BenchmarkScenario, model_init_fn from diffusers import WanTransformer3DModel from diffusers.utils.testing_utils import torch_device CKPT_ID = "Wan-AI/Wan2.1-T2V-14B-Diffusers" RESULT_FILENAME = "wan.csv" def get_input_dict(**device_dtype_kwargs): # height: 480 # width: 832 # num_frames: 81 # max_sequence_length: 512 hidden_states = torch.randn(1, 16, 21, 60, 104, **device_dtype_kwargs) encoder_hidden_states = torch.randn(1, 512, 4096, **device_dtype_kwargs) timestep = torch.tensor([1.0], **device_dtype_kwargs) return {"hidden_states": hidden_states, "encoder_hidden_states": encoder_hidden_states, "timestep": timestep} if __name__ == "__main__": scenarios = [ BenchmarkScenario( name=f"{CKPT_ID}-bf16", model_cls=WanTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=model_init_fn, compile_kwargs={"fullgraph": True}, ), BenchmarkScenario( name=f"{CKPT_ID}-layerwise-upcasting", model_cls=WanTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial(model_init_fn, layerwise_upcasting=True), ), BenchmarkScenario( name=f"{CKPT_ID}-group-offload-leaf", model_cls=WanTransformer3DModel, model_init_kwargs={ "pretrained_model_name_or_path": CKPT_ID, "torch_dtype": torch.bfloat16, "subfolder": "transformer", }, get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16), model_init_fn=partial( model_init_fn, group_offload_kwargs={ "onload_device": torch_device, "offload_device": torch.device("cpu"), "offload_type": "leaf_level", "use_stream": True, "non_blocking": True, }, ), ), ] runner = BenchmarkMixin() runner.run_bencmarks_and_collate(scenarios, filename=RESULT_FILENAME) ================================================ FILE: benchmarks/push_results.py ================================================ import os import pandas as pd from huggingface_hub import hf_hub_download, upload_file from huggingface_hub.utils import EntryNotFoundError REPO_ID = "diffusers/benchmarks" def has_previous_benchmark() -> str: from run_all import FINAL_CSV_FILENAME csv_path = None try: csv_path = hf_hub_download(repo_id=REPO_ID, repo_type="dataset", filename=FINAL_CSV_FILENAME) except EntryNotFoundError: csv_path = None return csv_path def filter_float(value): if isinstance(value, str): return float(value.split()[0]) return value def push_to_hf_dataset(): from run_all import FINAL_CSV_FILENAME, GITHUB_SHA csv_path = has_previous_benchmark() if csv_path is not None: current_results = pd.read_csv(FINAL_CSV_FILENAME) previous_results = pd.read_csv(csv_path) numeric_columns = current_results.select_dtypes(include=["float64", "int64"]).columns for column in numeric_columns: # get previous values as floats, aligned to current index prev_vals = previous_results[column].map(filter_float).reindex(current_results.index) # get current values as floats curr_vals = current_results[column].astype(float) # stringify the current values curr_str = curr_vals.map(str) # build an appendage only when prev exists and differs append_str = prev_vals.where(prev_vals.notnull() & (prev_vals != curr_vals), other=pd.NA).map( lambda x: f" ({x})" if pd.notnull(x) else "" ) # combine current_results[column] = curr_str + append_str os.remove(FINAL_CSV_FILENAME) current_results.to_csv(FINAL_CSV_FILENAME, index=False) commit_message = f"upload from sha: {GITHUB_SHA}" if GITHUB_SHA is not None else "upload benchmark results" upload_file( repo_id=REPO_ID, path_in_repo=FINAL_CSV_FILENAME, path_or_fileobj=FINAL_CSV_FILENAME, repo_type="dataset", commit_message=commit_message, ) upload_file( repo_id="diffusers/benchmark-analyzer", path_in_repo=FINAL_CSV_FILENAME, path_or_fileobj=FINAL_CSV_FILENAME, repo_type="space", commit_message=commit_message, ) if __name__ == "__main__": push_to_hf_dataset() ================================================ FILE: benchmarks/requirements.txt ================================================ pandas psutil gpustat torchprofile bitsandbytes psycopg2==2.9.9 ================================================ FILE: benchmarks/run_all.py ================================================ import glob import logging import os import subprocess import pandas as pd logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s") logger = logging.getLogger(__name__) PATTERN = "benchmarking_*.py" FINAL_CSV_FILENAME = "collated_results.csv" GITHUB_SHA = os.getenv("GITHUB_SHA", None) class SubprocessCallException(Exception): pass def run_command(command: list[str], return_stdout=False): try: output = subprocess.check_output(command, stderr=subprocess.STDOUT) if return_stdout and hasattr(output, "decode"): return output.decode("utf-8") except subprocess.CalledProcessError as e: raise SubprocessCallException(f"Command `{' '.join(command)}` failed with:\n{e.output.decode()}") from e def merge_csvs(final_csv: str = "collated_results.csv"): all_csvs = glob.glob("*.csv") all_csvs = [f for f in all_csvs if f != final_csv] if not all_csvs: logger.info("No result CSVs found to merge.") return df_list = [] for f in all_csvs: try: d = pd.read_csv(f) except pd.errors.EmptyDataError: # If a file existed but was zero‐bytes or corrupted, skip it continue df_list.append(d) if not df_list: logger.info("All result CSVs were empty or invalid; nothing to merge.") return final_df = pd.concat(df_list, ignore_index=True) if GITHUB_SHA is not None: final_df["github_sha"] = GITHUB_SHA final_df.to_csv(final_csv, index=False) logger.info(f"Merged {len(all_csvs)} partial CSVs → {final_csv}.") def run_scripts(): python_files = sorted(glob.glob(PATTERN)) python_files = [f for f in python_files if f != "benchmarking_utils.py"] for file in python_files: script_name = file.split(".py")[0].split("_")[-1] # example: benchmarking_foo.py -> foo logger.info(f"\n****** Running file: {file} ******") partial_csv = f"{script_name}.csv" if os.path.exists(partial_csv): logger.info(f"Found {partial_csv}. Removing for safer numbers and duplication.") os.remove(partial_csv) command = ["python", file] try: run_command(command) logger.info(f"→ {file} finished normally.") except SubprocessCallException as e: logger.info(f"Error running {file}:\n{e}") finally: logger.info(f"→ Merging partial CSVs after {file} …") merge_csvs(final_csv=FINAL_CSV_FILENAME) logger.info(f"\nAll scripts attempted. Final collated CSV: {FINAL_CSV_FILENAME}") if __name__ == "__main__": run_scripts() ================================================ FILE: docker/diffusers-doc-builder/Dockerfile ================================================ FROM python:3.10-slim ENV PYTHONDONTWRITEBYTECODE=1 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update && apt-get install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libglib2.0-0 \ libsndfile1-dev \ libgl1 \ zip \ wget ENV UV_PYTHON=/usr/local/bin/python # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) RUN pip install uv RUN uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio \ --extra-index-url https://download.pytorch.org/whl/cpu RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/diffusers.git@main#egg=diffusers[test]" # Extra dependencies RUN uv pip install --no-cache-dir \ accelerate \ numpy==1.26.4 \ hf_xet \ setuptools==69.5.1 \ bitsandbytes \ torchao \ gguf \ optimum-quanto RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-onnxruntime-cpu/Dockerfile ================================================ FROM ubuntu:20.04 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa RUN apt install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libsndfile1-dev \ libgl1 \ python3.10 \ python3-pip \ python3.10-venv && \ rm -rf /var/lib/apt/lists # make sure to use venv RUN python3.10 -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) RUN python3 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ python3 -m uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio\ onnxruntime \ --extra-index-url https://download.pytorch.org/whl/cpu && \ python3 -m uv pip install --no-cache-dir \ accelerate \ datasets \ hf-doc-builder \ huggingface-hub \ Jinja2 \ librosa \ numpy==1.26.4 \ scipy \ tensorboard \ transformers \ hf_xet CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-onnxruntime-cuda/Dockerfile ================================================ FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa RUN apt install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libsndfile1-dev \ libgl1 \ python3.10 \ python3-pip \ python3.10-venv && \ rm -rf /var/lib/apt/lists # make sure to use venv RUN python3.10 -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \ python3.10 -m uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio \ "onnxruntime-gpu>=1.13.1" \ --extra-index-url https://download.pytorch.org/whl/cu117 && \ python3.10 -m uv pip install --no-cache-dir \ accelerate \ datasets \ hf-doc-builder \ huggingface-hub \ hf_xet \ Jinja2 \ librosa \ numpy==1.26.4 \ scipy \ tensorboard \ transformers CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-pytorch-cpu/Dockerfile ================================================ FROM python:3.10-slim ENV PYTHONDONTWRITEBYTECODE=1 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update && apt-get install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libglib2.0-0 \ libsndfile1-dev \ libgl1 ENV UV_PYTHON=/usr/local/bin/python # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) RUN pip install uv RUN uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio \ --extra-index-url https://download.pytorch.org/whl/cpu RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/diffusers.git@main#egg=diffusers[test]" # Extra dependencies RUN uv pip install --no-cache-dir \ accelerate \ numpy==1.26.4 \ hf_xet RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-pytorch-cuda/Dockerfile ================================================ FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ARG PYTHON_VERSION=3.10 ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa && \ apt-get update RUN apt install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libglib2.0-0 \ libsndfile1-dev \ libgl1 \ python3 \ python3-pip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* RUN curl -LsSf https://astral.sh/uv/install.sh | sh ENV PATH="/root/.local/bin:$PATH" ENV VIRTUAL_ENV="/opt/venv" ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python RUN uv venv --python ${PYTHON_VERSION} --seed ${VIRTUAL_ENV} ENV PATH="$VIRTUAL_ENV/bin:$PATH" # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) # Install torch, torchvision, and torchaudio together to ensure compatibility RUN uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio \ --index-url https://download.pytorch.org/whl/cu129 # Install compatible versions of numba/llvmlite for Python 3.10+ RUN uv pip install --no-cache-dir \ "llvmlite>=0.40.0" \ "numba>=0.57.0" RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/diffusers.git@main#egg=diffusers[test]" # Extra dependencies RUN uv pip install --no-cache-dir \ accelerate \ numpy==1.26.4 \ pytorch-lightning \ hf_xet CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-pytorch-minimum-cuda/Dockerfile ================================================ FROM nvidia/cuda:12.1.0-runtime-ubuntu20.04 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ARG PYTHON_VERSION=3.10 ENV DEBIAN_FRONTEND=noninteractive ENV MINIMUM_SUPPORTED_TORCH_VERSION="2.1.0" ENV MINIMUM_SUPPORTED_TORCHVISION_VERSION="0.16.0" ENV MINIMUM_SUPPORTED_TORCHAUDIO_VERSION="2.1.0" RUN apt-get -y update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa && \ apt-get update RUN apt install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libglib2.0-0 \ libsndfile1-dev \ libgl1 \ python3 \ python3-pip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* RUN curl -LsSf https://astral.sh/uv/install.sh | sh ENV PATH="/root/.local/bin:$PATH" ENV VIRTUAL_ENV="/opt/venv" ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python RUN uv venv --python ${PYTHON_VERSION} --seed ${VIRTUAL_ENV} ENV PATH="$VIRTUAL_ENV/bin:$PATH" # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) RUN uv pip install --no-cache-dir \ torch==$MINIMUM_SUPPORTED_TORCH_VERSION \ torchvision==$MINIMUM_SUPPORTED_TORCHVISION_VERSION \ torchaudio==$MINIMUM_SUPPORTED_TORCHAUDIO_VERSION RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/diffusers.git@main#egg=diffusers[test]" # Extra dependencies RUN uv pip install --no-cache-dir \ accelerate \ numpy==1.26.4 \ pytorch-lightning \ hf_xet CMD ["/bin/bash"] ================================================ FILE: docker/diffusers-pytorch-xformers-cuda/Dockerfile ================================================ FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04 LABEL maintainer="Hugging Face" LABEL repository="diffusers" ARG PYTHON_VERSION=3.10 ENV DEBIAN_FRONTEND=noninteractive RUN apt-get -y update \ && apt-get install -y software-properties-common \ && add-apt-repository ppa:deadsnakes/ppa && \ apt-get update RUN apt install -y bash \ build-essential \ git \ git-lfs \ curl \ ca-certificates \ libglib2.0-0 \ libsndfile1-dev \ libgl1 \ python3 \ python3-pip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* RUN curl -LsSf https://astral.sh/uv/install.sh | sh ENV PATH="/root/.local/bin:$PATH" ENV VIRTUAL_ENV="/opt/venv" ENV UV_PYTHON_INSTALL_DIR=/opt/uv/python RUN uv venv --python ${PYTHON_VERSION} --seed ${VIRTUAL_ENV} ENV PATH="$VIRTUAL_ENV/bin:$PATH" # pre-install the heavy dependencies (these can later be overridden by the deps from setup.py) # Install torch, torchvision, and torchaudio together to ensure compatibility RUN uv pip install --no-cache-dir \ torch \ torchvision \ torchaudio \ --index-url https://download.pytorch.org/whl/cu129 # Install compatible versions of numba/llvmlite for Python 3.10+ RUN uv pip install --no-cache-dir \ "llvmlite>=0.40.0" \ "numba>=0.57.0" RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/diffusers.git@main#egg=diffusers[test]" # Extra dependencies RUN uv pip install --no-cache-dir \ accelerate \ numpy==1.26.4 \ pytorch-lightning \ hf_xet \ xformers CMD ["/bin/bash"] ================================================ FILE: docs/README.md ================================================ # Generating the documentation To generate the documentation, you first have to build it. Several packages are necessary to build the doc, you can install them with the following command, at the root of the code repository: ```bash pip install -e ".[docs]" ``` Then you need to install our open source documentation builder tool: ```bash pip install git+https://github.com/huggingface/doc-builder ``` --- **NOTE** You only need to generate the documentation to inspect it locally (if you're planning changes and want to check how they look before committing for instance). You don't have to commit the built documentation. --- ## Previewing the documentation To preview the docs, first install the `watchdog` module with: ```bash pip install watchdog ``` Then run the following command: ```bash doc-builder preview {package_name} {path_to_docs} ``` For example: ```bash doc-builder preview diffusers docs/source/en ``` The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives. --- **NOTE** The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again). --- ## Adding a new element to the navigation bar Accepted files are Markdown (.md). Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml) file. ## Renaming section headers and moving sections It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums, and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information. Therefore, we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor. So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file: ```md Sections that were moved: [ Section A ] ``` and of course, if you moved it to another file, then: ```md Sections that were moved: [ Section A ] ``` Use the relative style to link to the new file so that the versioned docs continue to work. For an example of a rich moved section set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md). ## Writing Documentation - Specification The `huggingface/diffusers` documentation follows the [Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style for docstrings, although we can write them directly in Markdown. ### Adding a new tutorial Adding a new tutorial or section is done in two steps: - Add a new Markdown (.md) file under `docs/source/`. - Link that file in `docs/source//_toctree.yml` on the correct toc-tree. Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so depending on the intended targets (beginners, more advanced users, or researchers) it should go in sections two, three, or four. ### Adding a new pipeline/scheduler When adding a new pipeline: - Create a file `xxx.md` under `docs/source//api/pipelines` (don't hesitate to copy an existing file as template). - Link that file in (*Diffusers Summary*) section in `docs/source/api/pipelines/overview.md`, along with the link to the paper, and a colab notebook (if available). - Write a short overview of the diffusion model: - Overview with paper & authors - Paper abstract - Tips and tricks and how to use it best - Possible an end-to-end example of how to use it - Add all the pipeline classes that should be linked in the diffusion model. These classes should be added using our Markdown syntax. By default as follows: ``` [[autodoc]] XXXPipeline - all - __call__ ``` This will include every public method of the pipeline that is documented, as well as the `__call__` method that is not documented by default. If you just want to add additional methods that are not documented, you can put the list of all methods to add in a list that contains `all`. ``` [[autodoc]] XXXPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ``` You can follow the same process to create a new scheduler under the `docs/source//api/schedulers` folder. ### Writing source documentation Values that should be put in `code` should either be surrounded by backticks: \`like so\`. Note that argument names and objects like True, None, or any strings should usually be put in `code`. When mentioning a class, function, or method, it is recommended to use our syntax for internal links so that our tool adds a link to its documentation with this syntax: \[\`XXXClass\`\] or \[\`function\`\]. This requires the class or function to be in the main package. If you want to create a link to some internal class or function, you need to provide its path. For instance: \[\`pipelines.ImagePipelineOutput\`\]. This will be converted into a link with `pipelines.ImagePipelineOutput` in the description. To get rid of the path and only keep the name of the object you are linking to in the description, add a ~: \[\`~pipelines.ImagePipelineOutput\`\] will generate a link with `ImagePipelineOutput` in the description. The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\]. #### Defining arguments in a method Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`) prefix, followed by a line return and an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon, and its description: ``` Args: n_layers (`int`): The number of layers of the model. ``` If the description is too long to fit in one line, another indentation is necessary before writing the description after the argument. Here's an example showcasing everything so far: ``` Args: input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): Indices of input sequence tokens in the vocabulary. Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and [`~PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) ``` For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the following signature: ```py def my_function(x: str=None, a: float=3.14): ``` then its documentation should look like this: ``` Args: x (`str`, *optional*): This argument controls ... a (`float`, *optional*, defaults to `3.14`): This argument is used to ... ``` Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even if the first line describing your argument type and its default gets long, you can't break it on several lines. You can however write as many lines as you want in the indented description (see the example above with `input_ids`). #### Writing a multi-line code block Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown: ```` ``` # first line of code # second line # etc ``` ```` #### Writing a return block The return block should be introduced with the `Returns:` prefix, followed by a line return and an indentation. The first line should be the type of the return, followed by a line return. No need to indent further for the elements building the return. Here's an example of a single value return: ``` Returns: `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token. ``` Here's an example of a tuple return, comprising several objects: ``` Returns: `tuple(torch.Tensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs: - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.Tensor` of shape `(1,)` -- Total loss is the sum of the masked language modeling loss and the next sequence prediction (classification) loss. - **prediction_scores** (`torch.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`) -- Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). ``` #### Adding an image Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images). If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images to this dataset. ## Styling the docstring We have an automatic script running with the `make style` command that will make sure that: - the docstrings fully take advantage of the line width - all code examples are formatted using black, like the code of the Transformers library This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's recommended to commit your changes before running `make style`, so you can revert the changes done by that script easily. ================================================ FILE: docs/TRANSLATING.md ================================================ ### Translating the Diffusers documentation into your language As part of our mission to democratize machine learning, we'd love to make the Diffusers library available in many more languages! Follow the steps below if you want to help translate the documentation into your language 🙏. **🗞️ Open an issue** To get started, navigate to the [Issues](https://github.com/huggingface/diffusers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "🌐 Translating a New Language?" from the "New issue" button. Once an issue exists, post a comment to indicate which chapters you'd like to work on, and we'll add your name to the list. **🍴 Fork the repository** First, you'll need to [fork the Diffusers repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo). You can do this by clicking on the **Fork** button on the top-right corner of this repo's page. Once you've forked the repo, you'll want to get the files on your local machine for editing. You can do that by cloning the fork with Git as follows: ```bash git clone https://github.com//diffusers.git ``` **📋 Copy-paste the English version with a new language code** The documentation files are in one leading directory: - [`docs/source`](https://github.com/huggingface/diffusers/tree/main/docs/source): All the documentation materials are organized here by language. You'll only need to copy the files in the [`docs/source/en`](https://github.com/huggingface/diffusers/tree/main/docs/source/en) directory, so first navigate to your fork of the repo and run the following: ```bash cd ~/path/to/diffusers/docs cp -r source/en source/ ``` Here, `` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table. **✍️ Start translating** The fun part comes - translating the text! The first thing we recommend is translating the part of the `_toctree.yml` file that corresponds to your doc chapter. This file is used to render the table of contents on the website. > 🙋 If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source//` directory! The fields you should add are `local` (with the name of the file containing the translation; e.g. `autoclass_tutorial`), and `title` (with the title of the doc in your language; e.g. `Load pretrained instances with an AutoClass`) -- as a reference, here is the `_toctree.yml` for [English](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml): ```yaml - sections: - local: pipeline_tutorial # Do not change this! Use the same name for your .md file title: Pipelines for inference # Translate this! ... title: Tutorials # Translate this! ``` Once you have translated the `_toctree.yml` file, you can start translating the [MDX](https://mdxjs.com/) files associated with your docs chapter. > 🙋 If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/diffusers/issues) and tag @patrickvonplaten. ================================================ FILE: docs/source/_config.py ================================================ # docstyle-ignore INSTALL_CONTENT = """ # Diffusers installation ! pip install diffusers transformers datasets accelerate # To install from source instead of the last release, comment the command above and uncomment the following one. # ! pip install git+https://github.com/huggingface/diffusers.git """ notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}] ================================================ FILE: docs/source/en/_toctree.yml ================================================ - sections: - local: index title: Diffusers - local: installation title: Installation - local: quicktour title: Quickstart - local: stable_diffusion title: Basic performance title: Get started - isExpanded: false sections: - local: using-diffusers/loading title: DiffusionPipeline - local: tutorials/autopipeline title: AutoPipeline - local: using-diffusers/custom_pipeline_overview title: Community pipelines and components - local: using-diffusers/callback title: Pipeline callbacks - local: using-diffusers/reusing_seeds title: Reproducibility - local: using-diffusers/schedulers title: Schedulers - local: using-diffusers/guiders title: Guiders - local: using-diffusers/automodel title: AutoModel - local: using-diffusers/other-formats title: Model formats - local: using-diffusers/push_to_hub title: Sharing pipelines and models title: Pipelines - isExpanded: false sections: - local: tutorials/using_peft_for_inference title: LoRA - local: using-diffusers/ip_adapter title: IP-Adapter - local: using-diffusers/controlnet title: ControlNet - local: using-diffusers/t2i_adapter title: T2I-Adapter - local: using-diffusers/dreambooth title: DreamBooth - local: using-diffusers/textual_inversion_inference title: Textual inversion title: Adapters - isExpanded: false sections: - local: using-diffusers/weighted_prompts title: Prompting - local: using-diffusers/create_a_server title: Create a server - local: using-diffusers/batched_inference title: Batch inference - local: training/distributed_inference title: Distributed inference - local: hybrid_inference/overview title: Remote inference title: Inference - isExpanded: false sections: - local: optimization/fp16 title: Accelerate inference - local: optimization/cache title: Caching - local: optimization/attention_backends title: Attention backends - local: optimization/memory title: Reduce memory usage - local: optimization/speed-memory-optims title: Compiling and offloading quantized models - sections: - local: optimization/pruna title: Pruna - local: optimization/xformers title: xFormers - local: optimization/tome title: Token merging - local: optimization/deepcache title: DeepCache - local: optimization/cache_dit title: CacheDiT - local: optimization/tgate title: TGATE - local: optimization/xdit title: xDiT - local: optimization/para_attn title: ParaAttention - local: using-diffusers/image_quality title: FreeU title: Community optimizations title: Inference optimization - isExpanded: false sections: - local: modular_diffusers/overview title: Overview - local: modular_diffusers/quickstart title: Quickstart - local: modular_diffusers/modular_diffusers_states title: States - local: modular_diffusers/pipeline_block title: ModularPipelineBlocks - local: modular_diffusers/sequential_pipeline_blocks title: SequentialPipelineBlocks - local: modular_diffusers/loop_sequential_pipeline_blocks title: LoopSequentialPipelineBlocks - local: modular_diffusers/auto_pipeline_blocks title: AutoPipelineBlocks - local: modular_diffusers/modular_pipeline title: ModularPipeline - local: modular_diffusers/components_manager title: ComponentsManager - local: modular_diffusers/custom_blocks title: Building Custom Blocks - local: modular_diffusers/mellon title: Using Custom Blocks with Mellon title: Modular Diffusers - isExpanded: false sections: - local: training/overview title: Overview - local: training/create_dataset title: Create a dataset for training - local: training/adapt_a_model title: Adapt a model to a new task - local: tutorials/basic_training title: Train a diffusion model - sections: - local: training/unconditional_training title: Unconditional image generation - local: training/text2image title: Text-to-image - local: training/sdxl title: Stable Diffusion XL - local: training/kandinsky title: Kandinsky 2.2 - local: training/wuerstchen title: Wuerstchen - local: training/controlnet title: ControlNet - local: training/t2i_adapters title: T2I-Adapters - local: training/instructpix2pix title: InstructPix2Pix - local: training/cogvideox title: CogVideoX title: Models - sections: - local: training/text_inversion title: Textual Inversion - local: training/dreambooth title: DreamBooth - local: training/lora title: LoRA - local: training/custom_diffusion title: Custom Diffusion - local: training/lcm_distill title: Latent Consistency Distillation - local: training/ddpo title: Reinforcement learning training with DDPO title: Methods title: Training - isExpanded: false sections: - local: quantization/overview title: Getting started - local: quantization/bitsandbytes title: bitsandbytes - local: quantization/gguf title: gguf - local: quantization/torchao title: torchao - local: quantization/quanto title: quanto - local: quantization/modelopt title: NVIDIA ModelOpt title: Quantization - isExpanded: false sections: - local: optimization/onnx title: ONNX - local: optimization/open_vino title: OpenVINO - local: optimization/coreml title: Core ML - local: optimization/mps title: Metal Performance Shaders (MPS) - local: optimization/habana title: Intel Gaudi - local: optimization/neuron title: AWS Neuron title: Model accelerators and hardware - isExpanded: false sections: - local: using-diffusers/helios title: Helios - local: using-diffusers/consisid title: ConsisID - local: using-diffusers/sdxl title: Stable Diffusion XL - local: using-diffusers/sdxl_turbo title: SDXL Turbo - local: using-diffusers/kandinsky title: Kandinsky - local: using-diffusers/omnigen title: OmniGen - local: using-diffusers/pag title: PAG - local: using-diffusers/inference_with_lcm title: Latent Consistency Model - local: using-diffusers/shap-e title: Shap-E - local: using-diffusers/diffedit title: DiffEdit - local: using-diffusers/inference_with_tcd_lora title: Trajectory Consistency Distillation-LoRA - local: using-diffusers/svd title: Stable Video Diffusion - local: using-diffusers/marigold_usage title: Marigold Computer Vision title: Specific pipeline examples - isExpanded: false sections: - sections: - local: using-diffusers/unconditional_image_generation title: Unconditional image generation - local: using-diffusers/conditional_image_generation title: Text-to-image - local: using-diffusers/img2img title: Image-to-image - local: using-diffusers/inpaint title: Inpainting - local: advanced_inference/outpaint title: Outpainting - local: using-diffusers/text-img2vid title: Video generation - local: using-diffusers/depth2img title: Depth-to-image title: Task recipes - local: using-diffusers/write_own_pipeline title: Understanding pipelines, models and schedulers - local: community_projects title: Projects built with Diffusers - local: conceptual/philosophy title: Philosophy - local: using-diffusers/controlling_generation title: Controlled generation - local: conceptual/contribution title: How to contribute? - local: conceptual/ethical_guidelines title: Diffusers' Ethical Guidelines - local: conceptual/evaluation title: Evaluating Diffusion Models title: Resources - isExpanded: false sections: - sections: - local: api/configuration title: Configuration - local: api/logging title: Logging - local: api/outputs title: Outputs - local: api/quantization title: Quantization - local: hybrid_inference/api_reference title: Remote inference - local: api/parallel title: Parallel inference title: Main Classes - sections: - local: api/modular_diffusers/pipeline title: Pipeline - local: api/modular_diffusers/pipeline_blocks title: Blocks - local: api/modular_diffusers/pipeline_states title: States - local: api/modular_diffusers/pipeline_components title: Components and configs - local: api/modular_diffusers/guiders title: Guiders title: Modular - sections: - local: api/loaders/ip_adapter title: IP-Adapter - local: api/loaders/lora title: LoRA - local: api/loaders/single_file title: Single files - local: api/loaders/textual_inversion title: Textual Inversion - local: api/loaders/unet title: UNet - local: api/loaders/transformer_sd3 title: SD3Transformer2D - local: api/loaders/peft title: PEFT title: Loaders - sections: - local: api/models/overview title: Overview - local: api/models/auto_model title: AutoModel - sections: - local: api/models/controlnet title: ControlNetModel - local: api/models/controlnet_union title: ControlNetUnionModel - local: api/models/controlnet_flux title: FluxControlNetModel - local: api/models/controlnet_hunyuandit title: HunyuanDiT2DControlNetModel - local: api/models/controlnet_sana title: SanaControlNetModel - local: api/models/controlnet_sd3 title: SD3ControlNetModel - local: api/models/controlnet_sparsectrl title: SparseControlNetModel title: ControlNets - sections: - local: api/models/allegro_transformer3d title: AllegroTransformer3DModel - local: api/models/aura_flow_transformer2d title: AuraFlowTransformer2DModel - local: api/models/transformer_bria_fibo title: BriaFiboTransformer2DModel - local: api/models/bria_transformer title: BriaTransformer2DModel - local: api/models/chroma_transformer title: ChromaTransformer2DModel - local: api/models/chronoedit_transformer_3d title: ChronoEditTransformer3DModel - local: api/models/cogvideox_transformer3d title: CogVideoXTransformer3DModel - local: api/models/cogview3plus_transformer2d title: CogView3PlusTransformer2DModel - local: api/models/cogview4_transformer2d title: CogView4Transformer2DModel - local: api/models/consisid_transformer3d title: ConsisIDTransformer3DModel - local: api/models/cosmos_transformer3d title: CosmosTransformer3DModel - local: api/models/dit_transformer2d title: DiTTransformer2DModel - local: api/models/easyanimate_transformer3d title: EasyAnimateTransformer3DModel - local: api/models/flux2_transformer title: Flux2Transformer2DModel - local: api/models/flux_transformer title: FluxTransformer2DModel - local: api/models/glm_image_transformer2d title: GlmImageTransformer2DModel - local: api/models/helios_transformer3d title: HeliosTransformer3DModel - local: api/models/hidream_image_transformer title: HiDreamImageTransformer2DModel - local: api/models/hunyuan_transformer2d title: HunyuanDiT2DModel - local: api/models/hunyuanimage_transformer_2d title: HunyuanImageTransformer2DModel - local: api/models/hunyuan_video15_transformer_3d title: HunyuanVideo15Transformer3DModel - local: api/models/hunyuan_video_transformer_3d title: HunyuanVideoTransformer3DModel - local: api/models/latte_transformer3d title: LatteTransformer3DModel - local: api/models/longcat_image_transformer2d title: LongCatImageTransformer2DModel - local: api/models/ltx2_video_transformer3d title: LTX2VideoTransformer3DModel - local: api/models/ltx_video_transformer3d title: LTXVideoTransformer3DModel - local: api/models/lumina2_transformer2d title: Lumina2Transformer2DModel - local: api/models/lumina_nextdit2d title: LuminaNextDiT2DModel - local: api/models/mochi_transformer3d title: MochiTransformer3DModel - local: api/models/omnigen_transformer title: OmniGenTransformer2DModel - local: api/models/ovisimage_transformer2d title: OvisImageTransformer2DModel - local: api/models/pixart_transformer2d title: PixArtTransformer2DModel - local: api/models/prior_transformer title: PriorTransformer - local: api/models/qwenimage_transformer2d title: QwenImageTransformer2DModel - local: api/models/sana_transformer2d title: SanaTransformer2DModel - local: api/models/sana_video_transformer3d title: SanaVideoTransformer3DModel - local: api/models/sd3_transformer2d title: SD3Transformer2DModel - local: api/models/skyreels_v2_transformer_3d title: SkyReelsV2Transformer3DModel - local: api/models/stable_audio_transformer title: StableAudioDiTModel - local: api/models/transformer2d title: Transformer2DModel - local: api/models/transformer_temporal title: TransformerTemporalModel - local: api/models/wan_animate_transformer_3d title: WanAnimateTransformer3DModel - local: api/models/wan_transformer_3d title: WanTransformer3DModel - local: api/models/z_image_transformer2d title: ZImageTransformer2DModel title: Transformers - sections: - local: api/models/stable_cascade_unet title: StableCascadeUNet - local: api/models/unet title: UNet1DModel - local: api/models/unet2d-cond title: UNet2DConditionModel - local: api/models/unet2d title: UNet2DModel - local: api/models/unet3d-cond title: UNet3DConditionModel - local: api/models/unet-motion title: UNetMotionModel - local: api/models/uvit2d title: UViT2DModel title: UNets - sections: - local: api/models/asymmetricautoencoderkl title: AsymmetricAutoencoderKL - local: api/models/autoencoder_dc title: AutoencoderDC - local: api/models/autoencoderkl title: AutoencoderKL - local: api/models/autoencoderkl_allegro title: AutoencoderKLAllegro - local: api/models/autoencoderkl_cogvideox title: AutoencoderKLCogVideoX - local: api/models/autoencoderkl_cosmos title: AutoencoderKLCosmos - local: api/models/autoencoder_kl_hunyuanimage title: AutoencoderKLHunyuanImage - local: api/models/autoencoder_kl_hunyuanimage_refiner title: AutoencoderKLHunyuanImageRefiner - local: api/models/autoencoder_kl_hunyuan_video title: AutoencoderKLHunyuanVideo - local: api/models/autoencoder_kl_hunyuan_video15 title: AutoencoderKLHunyuanVideo15 - local: api/models/autoencoderkl_audio_ltx_2 title: AutoencoderKLLTX2Audio - local: api/models/autoencoderkl_ltx_2 title: AutoencoderKLLTX2Video - local: api/models/autoencoderkl_ltx_video title: AutoencoderKLLTXVideo - local: api/models/autoencoderkl_magvit title: AutoencoderKLMagvit - local: api/models/autoencoderkl_mochi title: AutoencoderKLMochi - local: api/models/autoencoderkl_qwenimage title: AutoencoderKLQwenImage - local: api/models/autoencoder_kl_wan title: AutoencoderKLWan - local: api/models/autoencoder_rae title: AutoencoderRAE - local: api/models/consistency_decoder_vae title: ConsistencyDecoderVAE - local: api/models/autoencoder_oobleck title: Oobleck AutoEncoder - local: api/models/autoencoder_tiny title: Tiny AutoEncoder - local: api/models/vq title: VQModel title: VAEs title: Models - sections: - local: api/pipelines/overview title: Overview - local: api/pipelines/auto_pipeline title: AutoPipeline - sections: - local: api/pipelines/audioldm title: AudioLDM - local: api/pipelines/audioldm2 title: AudioLDM 2 - local: api/pipelines/dance_diffusion title: Dance Diffusion - local: api/pipelines/musicldm title: MusicLDM - local: api/pipelines/stable_audio title: Stable Audio title: Audio - sections: - local: api/pipelines/amused title: aMUSEd - local: api/pipelines/animatediff title: AnimateDiff - local: api/pipelines/attend_and_excite title: Attend-and-Excite - local: api/pipelines/aura_flow title: AuraFlow - local: api/pipelines/blip_diffusion title: BLIP-Diffusion - local: api/pipelines/bria_3_2 title: Bria 3.2 - local: api/pipelines/bria_fibo title: Bria Fibo - local: api/pipelines/bria_fibo_edit title: Bria Fibo Edit - local: api/pipelines/chroma title: Chroma - local: api/pipelines/cogview3 title: CogView3 - local: api/pipelines/cogview4 title: CogView4 - local: api/pipelines/consistency_models title: Consistency Models - local: api/pipelines/controlnet title: ControlNet - local: api/pipelines/controlnet_flux title: ControlNet with Flux.1 - local: api/pipelines/controlnet_hunyuandit title: ControlNet with Hunyuan-DiT - local: api/pipelines/controlnet_sd3 title: ControlNet with Stable Diffusion 3 - local: api/pipelines/controlnet_sdxl title: ControlNet with Stable Diffusion XL - local: api/pipelines/controlnet_sana title: ControlNet-Sana - local: api/pipelines/controlnetxs title: ControlNet-XS - local: api/pipelines/controlnetxs_sdxl title: ControlNet-XS with Stable Diffusion XL - local: api/pipelines/controlnet_union title: ControlNetUnion - local: api/pipelines/ddim title: DDIM - local: api/pipelines/ddpm title: DDPM - local: api/pipelines/deepfloyd_if title: DeepFloyd IF - local: api/pipelines/diffedit title: DiffEdit - local: api/pipelines/dit title: DiT - local: api/pipelines/easyanimate title: EasyAnimate - local: api/pipelines/flux title: Flux - local: api/pipelines/flux2 title: Flux2 - local: api/pipelines/control_flux_inpaint title: FluxControlInpaint - local: api/pipelines/glm_image title: GLM-Image - local: api/pipelines/hidream title: HiDream-I1 - local: api/pipelines/hunyuandit title: Hunyuan-DiT - local: api/pipelines/hunyuanimage21 title: HunyuanImage2.1 - local: api/pipelines/pix2pix title: InstructPix2Pix - local: api/pipelines/kandinsky title: Kandinsky 2.1 - local: api/pipelines/kandinsky_v22 title: Kandinsky 2.2 - local: api/pipelines/kandinsky3 title: Kandinsky 3 - local: api/pipelines/kandinsky5_image title: Kandinsky 5.0 Image - local: api/pipelines/kolors title: Kolors - local: api/pipelines/latent_consistency_models title: Latent Consistency Models - local: api/pipelines/latent_diffusion title: Latent Diffusion - local: api/pipelines/ledits_pp title: LEDITS++ - local: api/pipelines/longcat_image title: LongCat-Image - local: api/pipelines/lumina2 title: Lumina 2.0 - local: api/pipelines/lumina title: Lumina-T2X - local: api/pipelines/marigold title: Marigold - local: api/pipelines/panorama title: MultiDiffusion - local: api/pipelines/omnigen title: OmniGen - local: api/pipelines/ovis_image title: Ovis-Image - local: api/pipelines/pag title: PAG - local: api/pipelines/paint_by_example title: Paint by Example - local: api/pipelines/pixart title: PixArt-α - local: api/pipelines/pixart_sigma title: PixArt-Σ - local: api/pipelines/prx title: PRX - local: api/pipelines/qwenimage title: QwenImage - local: api/pipelines/sana title: Sana - local: api/pipelines/sana_sprint title: Sana Sprint - local: api/pipelines/sana_video title: Sana Video - local: api/pipelines/self_attention_guidance title: Self-Attention Guidance - local: api/pipelines/semantic_stable_diffusion title: Semantic Guidance - local: api/pipelines/shap_e title: Shap-E - local: api/pipelines/stable_cascade title: Stable Cascade - sections: - local: api/pipelines/stable_diffusion/overview title: Overview - local: api/pipelines/stable_diffusion/depth2img title: Depth-to-image - local: api/pipelines/stable_diffusion/gligen title: GLIGEN (Grounded Language-to-Image Generation) - local: api/pipelines/stable_diffusion/image_variation title: Image variation - local: api/pipelines/stable_diffusion/img2img title: Image-to-image - local: api/pipelines/stable_diffusion/inpaint title: Inpainting - local: api/pipelines/stable_diffusion/latent_upscale title: Latent upscaler - local: api/pipelines/stable_diffusion/ldm3d_diffusion title: LDM3D Text-to-(RGB, Depth), Text-to-(RGB-pano, Depth-pano), LDM3D Upscaler - local: api/pipelines/stable_diffusion/stable_diffusion_safe title: Safe Stable Diffusion - local: api/pipelines/stable_diffusion/sdxl_turbo title: SDXL Turbo - local: api/pipelines/stable_diffusion/stable_diffusion_2 title: Stable Diffusion 2 - local: api/pipelines/stable_diffusion/stable_diffusion_3 title: Stable Diffusion 3 - local: api/pipelines/stable_diffusion/stable_diffusion_xl title: Stable Diffusion XL - local: api/pipelines/stable_diffusion/upscale title: Super-resolution - local: api/pipelines/stable_diffusion/adapter title: T2I-Adapter - local: api/pipelines/stable_diffusion/text2img title: Text-to-image title: Stable Diffusion - local: api/pipelines/stable_unclip title: Stable unCLIP - local: api/pipelines/unclip title: unCLIP - local: api/pipelines/unidiffuser title: UniDiffuser - local: api/pipelines/value_guided_sampling title: Value-guided sampling - local: api/pipelines/visualcloze title: VisualCloze - local: api/pipelines/wuerstchen title: Wuerstchen - local: api/pipelines/z_image title: Z-Image title: Image - sections: - local: api/pipelines/allegro title: Allegro - local: api/pipelines/chronoedit title: ChronoEdit - local: api/pipelines/cogvideox title: CogVideoX - local: api/pipelines/consisid title: ConsisID - local: api/pipelines/cosmos title: Cosmos - local: api/pipelines/framepack title: Framepack - local: api/pipelines/helios title: Helios - local: api/pipelines/hunyuan_video title: HunyuanVideo - local: api/pipelines/hunyuan_video15 title: HunyuanVideo1.5 - local: api/pipelines/i2vgenxl title: I2VGen-XL - local: api/pipelines/kandinsky5_video title: Kandinsky 5.0 Video - local: api/pipelines/latte title: Latte - local: api/pipelines/ltx2 title: LTX-2 - local: api/pipelines/ltx_video title: LTXVideo - local: api/pipelines/mochi title: Mochi - local: api/pipelines/pia title: Personalized Image Animator (PIA) - local: api/pipelines/skyreels_v2 title: SkyReels-V2 - local: api/pipelines/stable_diffusion/svd title: Stable Video Diffusion - local: api/pipelines/text_to_video title: Text-to-video - local: api/pipelines/text_to_video_zero title: Text2Video-Zero - local: api/pipelines/wan title: Wan title: Video title: Pipelines - sections: - local: api/schedulers/overview title: Overview - local: api/schedulers/cm_stochastic_iterative title: CMStochasticIterativeScheduler - local: api/schedulers/ddim_cogvideox title: CogVideoXDDIMScheduler - local: api/schedulers/multistep_dpm_solver_cogvideox title: CogVideoXDPMScheduler - local: api/schedulers/consistency_decoder title: ConsistencyDecoderScheduler - local: api/schedulers/cosine_dpm title: CosineDPMSolverMultistepScheduler - local: api/schedulers/ddim_inverse title: DDIMInverseScheduler - local: api/schedulers/ddim title: DDIMScheduler - local: api/schedulers/ddpm title: DDPMScheduler - local: api/schedulers/deis title: DEISMultistepScheduler - local: api/schedulers/multistep_dpm_solver_inverse title: DPMSolverMultistepInverse - local: api/schedulers/multistep_dpm_solver title: DPMSolverMultistepScheduler - local: api/schedulers/dpm_sde title: DPMSolverSDEScheduler - local: api/schedulers/singlestep_dpm_solver title: DPMSolverSinglestepScheduler - local: api/schedulers/edm_multistep_dpm_solver title: EDMDPMSolverMultistepScheduler - local: api/schedulers/edm_euler title: EDMEulerScheduler - local: api/schedulers/euler_ancestral title: EulerAncestralDiscreteScheduler - local: api/schedulers/euler title: EulerDiscreteScheduler - local: api/schedulers/flow_match_euler_discrete title: FlowMatchEulerDiscreteScheduler - local: api/schedulers/flow_match_heun_discrete title: FlowMatchHeunDiscreteScheduler - local: api/schedulers/helios_dmd title: HeliosDMDScheduler - local: api/schedulers/helios title: HeliosScheduler - local: api/schedulers/heun title: HeunDiscreteScheduler - local: api/schedulers/ipndm title: IPNDMScheduler - local: api/schedulers/stochastic_karras_ve title: KarrasVeScheduler - local: api/schedulers/dpm_discrete_ancestral title: KDPM2AncestralDiscreteScheduler - local: api/schedulers/dpm_discrete title: KDPM2DiscreteScheduler - local: api/schedulers/lcm title: LCMScheduler - local: api/schedulers/lms_discrete title: LMSDiscreteScheduler - local: api/schedulers/pndm title: PNDMScheduler - local: api/schedulers/repaint title: RePaintScheduler - local: api/schedulers/score_sde_ve title: ScoreSdeVeScheduler - local: api/schedulers/score_sde_vp title: ScoreSdeVpScheduler - local: api/schedulers/tcd title: TCDScheduler - local: api/schedulers/unipc title: UniPCMultistepScheduler - local: api/schedulers/vq_diffusion title: VQDiffusionScheduler title: Schedulers - sections: - local: api/internal_classes_overview title: Overview - local: api/attnprocessor title: Attention Processor - local: api/activations title: Custom activation functions - local: api/cache title: Caching methods - local: api/normalization title: Custom normalization layers - local: api/utilities title: Utilities - local: api/image_processor title: VAE Image Processor - local: api/video_processor title: Video Processor title: Internal classes title: API ================================================ FILE: docs/source/en/advanced_inference/outpaint.md ================================================ # Outpainting Outpainting extends an image beyond its original boundaries, allowing you to add, replace, or modify visual elements in an image while preserving the original image. Like [inpainting](../using-diffusers/inpaint), you want to fill the white area (in this case, the area outside of the original image) with new visual elements while keeping the original image (represented by a mask of black pixels). There are a couple of ways to outpaint, such as with a [ControlNet](https://hf.co/blog/OzzyGT/outpainting-controlnet) or with [Differential Diffusion](https://hf.co/blog/OzzyGT/outpainting-differential-diffusion). This guide will show you how to outpaint with an inpainting model, ControlNet, and a ZoeDepth estimator. Before you begin, make sure you have the [controlnet_aux](https://github.com/huggingface/controlnet_aux) library installed so you can use the ZoeDepth estimator. ```py !pip install -q controlnet_aux ``` ## Image preparation Start by picking an image to outpaint with and remove the background with a Space like [BRIA-RMBG-1.4](https://hf.co/spaces/briaai/BRIA-RMBG-1.4). For example, remove the background from this image of a pair of shoes.
original image
background removed
[Stable Diffusion XL (SDXL)](../using-diffusers/sdxl) models work best with 1024x1024 images, but you can resize the image to any size as long as your hardware has enough memory to support it. The transparent background in the image should also be replaced with a white background. Create a function (like the one below) that scales and pastes the image onto a white background. ```py import random import requests import torch from controlnet_aux import ZoeDetector from PIL import Image, ImageOps from diffusers import ( AutoencoderKL, ControlNetModel, StableDiffusionXLControlNetPipeline, StableDiffusionXLInpaintPipeline, ) def scale_and_paste(original_image): aspect_ratio = original_image.width / original_image.height if original_image.width > original_image.height: new_width = 1024 new_height = round(new_width / aspect_ratio) else: new_height = 1024 new_width = round(new_height * aspect_ratio) resized_original = original_image.resize((new_width, new_height), Image.LANCZOS) white_background = Image.new("RGBA", (1024, 1024), "white") x = (1024 - new_width) // 2 y = (1024 - new_height) // 2 white_background.paste(resized_original, (x, y), resized_original) return resized_original, white_background original_image = Image.open( requests.get( "https://huggingface.co/datasets/stevhliu/testing-images/resolve/main/no-background-jordan.png", stream=True, ).raw ).convert("RGBA") resized_img, white_bg_image = scale_and_paste(original_image) ``` To avoid adding unwanted extra details, use the ZoeDepth estimator to provide additional guidance during generation and to ensure the shoes remain consistent with the original image. ```py zoe = ZoeDetector.from_pretrained("lllyasviel/Annotators") image_zoe = zoe(white_bg_image, detect_resolution=512, image_resolution=1024) image_zoe ```
## Outpaint Once your image is ready, you can generate content in the white area around the shoes with [controlnet-inpaint-dreamer-sdxl](https://hf.co/destitech/controlnet-inpaint-dreamer-sdxl), a SDXL ControlNet trained for inpainting. Load the inpainting ControlNet, ZoeDepth model, VAE and pass them to the [`StableDiffusionXLControlNetPipeline`]. Then you can create an optional `generate_image` function (for convenience) to outpaint an initial image. ```py controlnets = [ ControlNetModel.from_pretrained( "destitech/controlnet-inpaint-dreamer-sdxl", torch_dtype=torch.float16, variant="fp16" ), ControlNetModel.from_pretrained( "diffusers/controlnet-zoe-depth-sdxl-1.0", torch_dtype=torch.float16 ), ] vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda") pipeline = StableDiffusionXLControlNetPipeline.from_pretrained( "SG161222/RealVisXL_V4.0", torch_dtype=torch.float16, variant="fp16", controlnet=controlnets, vae=vae ).to("cuda") def generate_image(prompt, negative_prompt, inpaint_image, zoe_image, seed: int = None): if seed is None: seed = random.randint(0, 2**32 - 1) generator = torch.Generator(device="cpu").manual_seed(seed) image = pipeline( prompt, negative_prompt=negative_prompt, image=[inpaint_image, zoe_image], guidance_scale=6.5, num_inference_steps=25, generator=generator, controlnet_conditioning_scale=[0.5, 0.8], control_guidance_end=[0.9, 0.6], ).images[0] return image prompt = "nike air jordans on a basketball court" negative_prompt = "" temp_image = generate_image(prompt, negative_prompt, white_bg_image, image_zoe, 908097) ``` Paste the original image over the initial outpainted image. You'll improve the outpainted background in a later step. ```py x = (1024 - resized_img.width) // 2 y = (1024 - resized_img.height) // 2 temp_image.paste(resized_img, (x, y), resized_img) temp_image ```
> [!TIP] > Now is a good time to free up some memory if you're running low! > > ```py > pipeline=None > torch.cuda.empty_cache() > ``` Now that you have an initial outpainted image, load the [`StableDiffusionXLInpaintPipeline`] with the [RealVisXL](https://hf.co/SG161222/RealVisXL_V4.0) model to generate the final outpainted image with better quality. ```py pipeline = StableDiffusionXLInpaintPipeline.from_pretrained( "OzzyGT/RealVisXL_V4.0_inpainting", torch_dtype=torch.float16, variant="fp16", vae=vae, ).to("cuda") ``` Prepare a mask for the final outpainted image. To create a more natural transition between the original image and the outpainted background, blur the mask to help it blend better. ```py mask = Image.new("L", temp_image.size) mask.paste(resized_img.split()[3], (x, y)) mask = ImageOps.invert(mask) final_mask = mask.point(lambda p: p > 128 and 255) mask_blurred = pipeline.mask_processor.blur(final_mask, blur_factor=20) mask_blurred ```
Create a better prompt and pass it to the `generate_outpaint` function to generate the final outpainted image. Again, paste the original image over the final outpainted background. ```py def generate_outpaint(prompt, negative_prompt, image, mask, seed: int = None): if seed is None: seed = random.randint(0, 2**32 - 1) generator = torch.Generator(device="cpu").manual_seed(seed) image = pipeline( prompt, negative_prompt=negative_prompt, image=image, mask_image=mask, guidance_scale=10.0, strength=0.8, num_inference_steps=30, generator=generator, ).images[0] return image prompt = "high quality photo of nike air jordans on a basketball court, highly detailed" negative_prompt = "" final_image = generate_outpaint(prompt, negative_prompt, temp_image, mask_blurred, 7688778) x = (1024 - resized_img.width) // 2 y = (1024 - resized_img.height) // 2 final_image.paste(resized_img, (x, y), resized_img) final_image ```
================================================ FILE: docs/source/en/api/activations.md ================================================ # Activation functions Customized activation functions for supporting various models in 🤗 Diffusers. ## GELU [[autodoc]] models.activations.GELU ## GEGLU [[autodoc]] models.activations.GEGLU ## ApproximateGELU [[autodoc]] models.activations.ApproximateGELU ## SwiGLU [[autodoc]] models.activations.SwiGLU ## FP32SiLU [[autodoc]] models.activations.FP32SiLU ## LinearActivation [[autodoc]] models.activations.LinearActivation ================================================ FILE: docs/source/en/api/attnprocessor.md ================================================ # Attention Processor An attention processor is a class for applying different types of attention mechanisms. ## AttnProcessor [[autodoc]] models.attention_processor.AttnProcessor [[autodoc]] models.attention_processor.AttnProcessor2_0 [[autodoc]] models.attention_processor.AttnAddedKVProcessor [[autodoc]] models.attention_processor.AttnAddedKVProcessor2_0 [[autodoc]] models.attention_processor.AttnProcessorNPU [[autodoc]] models.attention_processor.FusedAttnProcessor2_0 ## Allegro [[autodoc]] models.attention_processor.AllegroAttnProcessor2_0 ## AuraFlow [[autodoc]] models.attention_processor.AuraFlowAttnProcessor2_0 [[autodoc]] models.attention_processor.FusedAuraFlowAttnProcessor2_0 ## CogVideoX [[autodoc]] models.attention_processor.CogVideoXAttnProcessor2_0 [[autodoc]] models.attention_processor.FusedCogVideoXAttnProcessor2_0 ## CrossFrameAttnProcessor [[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.CrossFrameAttnProcessor ## Custom Diffusion [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor [[autodoc]] models.attention_processor.CustomDiffusionAttnProcessor2_0 [[autodoc]] models.attention_processor.CustomDiffusionXFormersAttnProcessor ## Flux [[autodoc]] models.attention_processor.FluxAttnProcessor2_0 [[autodoc]] models.attention_processor.FusedFluxAttnProcessor2_0 [[autodoc]] models.attention_processor.FluxSingleAttnProcessor2_0 ## Hunyuan [[autodoc]] models.attention_processor.HunyuanAttnProcessor2_0 [[autodoc]] models.attention_processor.FusedHunyuanAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGHunyuanAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGCFGHunyuanAttnProcessor2_0 ## IdentitySelfAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGIdentitySelfAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGCFGIdentitySelfAttnProcessor2_0 ## IP-Adapter [[autodoc]] models.attention_processor.IPAdapterAttnProcessor [[autodoc]] models.attention_processor.IPAdapterAttnProcessor2_0 [[autodoc]] models.attention_processor.SD3IPAdapterJointAttnProcessor2_0 ## JointAttnProcessor2_0 [[autodoc]] models.attention_processor.JointAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGJointAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGCFGJointAttnProcessor2_0 [[autodoc]] models.attention_processor.FusedJointAttnProcessor2_0 ## LoRA [[autodoc]] models.attention_processor.LoRAAttnProcessor [[autodoc]] models.attention_processor.LoRAAttnProcessor2_0 [[autodoc]] models.attention_processor.LoRAAttnAddedKVProcessor [[autodoc]] models.attention_processor.LoRAXFormersAttnProcessor ## Lumina-T2X [[autodoc]] models.attention_processor.LuminaAttnProcessor2_0 ## Mochi [[autodoc]] models.attention_processor.MochiAttnProcessor2_0 [[autodoc]] models.attention_processor.MochiVaeAttnProcessor2_0 ## Sana [[autodoc]] models.attention_processor.SanaLinearAttnProcessor2_0 [[autodoc]] models.attention_processor.SanaMultiscaleAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGCFGSanaLinearAttnProcessor2_0 [[autodoc]] models.attention_processor.PAGIdentitySanaLinearAttnProcessor2_0 ## Stable Audio [[autodoc]] models.attention_processor.StableAudioAttnProcessor2_0 ## SlicedAttnProcessor [[autodoc]] models.attention_processor.SlicedAttnProcessor [[autodoc]] models.attention_processor.SlicedAttnAddedKVProcessor ## XFormersAttnProcessor [[autodoc]] models.attention_processor.XFormersAttnProcessor [[autodoc]] models.attention_processor.XFormersAttnAddedKVProcessor ## XLAFlashAttnProcessor2_0 [[autodoc]] models.attention_processor.XLAFlashAttnProcessor2_0 ## XFormersJointAttnProcessor [[autodoc]] models.attention_processor.XFormersJointAttnProcessor ## IPAdapterXFormersAttnProcessor [[autodoc]] models.attention_processor.IPAdapterXFormersAttnProcessor ## FluxIPAdapterJointAttnProcessor2_0 [[autodoc]] models.attention_processor.FluxIPAdapterJointAttnProcessor2_0 ## XLAFluxFlashAttnProcessor2_0 [[autodoc]] models.attention_processor.XLAFluxFlashAttnProcessor2_0 ================================================ FILE: docs/source/en/api/cache.md ================================================ # Caching methods Cache methods speedup diffusion transformers by storing and reusing intermediate outputs of specific layers, such as attention and feedforward layers, instead of recalculating them at each inference step. ## CacheMixin [[autodoc]] CacheMixin ## PyramidAttentionBroadcastConfig [[autodoc]] PyramidAttentionBroadcastConfig [[autodoc]] apply_pyramid_attention_broadcast ## FasterCacheConfig [[autodoc]] FasterCacheConfig [[autodoc]] apply_faster_cache ## FirstBlockCacheConfig [[autodoc]] FirstBlockCacheConfig [[autodoc]] apply_first_block_cache ### TaylorSeerCacheConfig [[autodoc]] TaylorSeerCacheConfig [[autodoc]] apply_taylorseer_cache ================================================ FILE: docs/source/en/api/configuration.md ================================================ # Configuration Schedulers from [`~schedulers.scheduling_utils.SchedulerMixin`] and models from [`ModelMixin`] inherit from [`ConfigMixin`] which stores all the parameters that are passed to their respective `__init__` methods in a JSON-configuration file. > [!TIP] > To use private or [gated](https://huggingface.co/docs/hub/models-gated#gated-models) models, log-in with `hf auth login`. ## ConfigMixin [[autodoc]] ConfigMixin - load_config - from_config - save_config - to_json_file - to_json_string ================================================ FILE: docs/source/en/api/image_processor.md ================================================ # VAE Image Processor The [`VaeImageProcessor`] provides a unified API for [`StableDiffusionPipeline`]s to prepare image inputs for VAE encoding and post-processing outputs once they're decoded. This includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. All pipelines with [`VaeImageProcessor`] accept PIL Image, PyTorch tensor, or NumPy arrays as image inputs and return outputs based on the `output_type` argument by the user. You can pass encoded image latents directly to the pipeline and return latents from the pipeline as a specific output with the `output_type` argument (for example `output_type="latent"`). This allows you to take the generated latents from one pipeline and pass it to another pipeline as input without leaving the latent space. It also makes it much easier to use multiple pipelines together by passing PyTorch tensors directly between different pipelines. ## VaeImageProcessor [[autodoc]] image_processor.VaeImageProcessor ## InpaintProcessor The [`InpaintProcessor`] accepts `mask` and `image` inputs and process them together. Optionally, it can accept padding_mask_crop and apply mask overlay. [[autodoc]] image_processor.InpaintProcessor ## VaeImageProcessorLDM3D The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs. [[autodoc]] image_processor.VaeImageProcessorLDM3D ## PixArtImageProcessor [[autodoc]] image_processor.PixArtImageProcessor ## IPAdapterMaskProcessor [[autodoc]] image_processor.IPAdapterMaskProcessor ================================================ FILE: docs/source/en/api/internal_classes_overview.md ================================================ # Overview The APIs in this section are more experimental and prone to breaking changes. Most of them are used internally for development, but they may also be useful to you if you're interested in building a diffusion model with some custom parts or if you're interested in some of our helper utilities for working with 🤗 Diffusers. ================================================ FILE: docs/source/en/api/loaders/ip_adapter.md ================================================ # IP-Adapter [IP-Adapter](https://hf.co/papers/2308.06721) is a lightweight adapter that enables prompting a diffusion model with an image. This method decouples the cross-attention layers of the image and text features. The image features are generated from an image encoder. > [!TIP] > Learn how to load and use an IP-Adapter checkpoint and image in the [IP-Adapter](../../using-diffusers/ip_adapter) guide,. ## IPAdapterMixin [[autodoc]] loaders.ip_adapter.IPAdapterMixin ## SD3IPAdapterMixin [[autodoc]] loaders.ip_adapter.SD3IPAdapterMixin - all - is_ip_adapter_active ## IPAdapterMaskProcessor [[autodoc]] image_processor.IPAdapterMaskProcessor ================================================ FILE: docs/source/en/api/loaders/lora.md ================================================ # LoRA LoRA is a fast and lightweight training method that inserts and trains a significantly smaller number of parameters instead of all the model parameters. This produces a smaller file (~100 MBs) and makes it easier to quickly train a model to learn a new concept. LoRA weights are typically loaded into the denoiser, text encoder or both. The denoiser usually corresponds to a UNet ([`UNet2DConditionModel`], for example) or a Transformer ([`SD3Transformer2DModel`], for example). There are several classes for loading LoRA weights: - [`StableDiffusionLoraLoaderMixin`] provides functions for loading and unloading, fusing and unfusing, enabling and disabling, and more functions for managing LoRA weights. This class can be used with any model. - [`StableDiffusionXLLoraLoaderMixin`] is a [Stable Diffusion (SDXL)](../../api/pipelines/stable_diffusion/stable_diffusion_xl) version of the [`StableDiffusionLoraLoaderMixin`] class for loading and saving LoRA weights. It can only be used with the SDXL model. - [`SD3LoraLoaderMixin`] provides similar functions for [Stable Diffusion 3](https://huggingface.co/blog/sd3). - [`FluxLoraLoaderMixin`] provides similar functions for [Flux](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux). - [`CogVideoXLoraLoaderMixin`] provides similar functions for [CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox). - [`Mochi1LoraLoaderMixin`] provides similar functions for [Mochi](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi). - [`AuraFlowLoraLoaderMixin`] provides similar functions for [AuraFlow](https://huggingface.co/fal/AuraFlow). - [`LTXVideoLoraLoaderMixin`] provides similar functions for [LTX-Video](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video). - [`SanaLoraLoaderMixin`] provides similar functions for [Sana](https://huggingface.co/docs/diffusers/main/en/api/pipelines/sana). - [`HeliosLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/helios). - [`HunyuanVideoLoraLoaderMixin`] provides similar functions for [HunyuanVideo](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hunyuan_video). - [`Lumina2LoraLoaderMixin`] provides similar functions for [Lumina2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/lumina2). - [`WanLoraLoaderMixin`] provides similar functions for [Wan](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan). - [`SkyReelsV2LoraLoaderMixin`] provides similar functions for [SkyReels-V2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/skyreels_v2). - [`CogView4LoraLoaderMixin`] provides similar functions for [CogView4](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogview4). - [`AmusedLoraLoaderMixin`] is for the [`AmusedPipeline`]. - [`HiDreamImageLoraLoaderMixin`] provides similar functions for [HiDream Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hidream) - [`QwenImageLoraLoaderMixin`] provides similar functions for [Qwen Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/qwen). - [`ZImageLoraLoaderMixin`] provides similar functions for [Z-Image](https://huggingface.co/docs/diffusers/main/en/api/pipelines/zimage). - [`Flux2LoraLoaderMixin`] provides similar functions for [Flux2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux2). - [`LTX2LoraLoaderMixin`] provides similar functions for [Flux2](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx2). - [`LoraBaseMixin`] provides a base class with several utility methods to fuse, unfuse, unload, LoRAs and more. > [!TIP] > To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) loading guide. ## LoraBaseMixin [[autodoc]] loaders.lora_base.LoraBaseMixin ## StableDiffusionLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.StableDiffusionLoraLoaderMixin ## StableDiffusionXLLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.StableDiffusionXLLoraLoaderMixin ## SD3LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.SD3LoraLoaderMixin ## FluxLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.FluxLoraLoaderMixin ## Flux2LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.Flux2LoraLoaderMixin ## LTX2LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.LTX2LoraLoaderMixin ## CogVideoXLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.CogVideoXLoraLoaderMixin ## Mochi1LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.Mochi1LoraLoaderMixin ## AuraFlowLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.AuraFlowLoraLoaderMixin ## LTXVideoLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.LTXVideoLoraLoaderMixin ## SanaLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.SanaLoraLoaderMixin ## HeliosLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.HeliosLoraLoaderMixin ## HunyuanVideoLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.HunyuanVideoLoraLoaderMixin ## Lumina2LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.Lumina2LoraLoaderMixin ## CogView4LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.CogView4LoraLoaderMixin ## WanLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.WanLoraLoaderMixin ## SkyReelsV2LoraLoaderMixin [[autodoc]] loaders.lora_pipeline.SkyReelsV2LoraLoaderMixin ## AmusedLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.AmusedLoraLoaderMixin ## HiDreamImageLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.HiDreamImageLoraLoaderMixin ## QwenImageLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.QwenImageLoraLoaderMixin ## ZImageLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.ZImageLoraLoaderMixin ## KandinskyLoraLoaderMixin [[autodoc]] loaders.lora_pipeline.KandinskyLoraLoaderMixin ## LoraBaseMixin [[autodoc]] loaders.lora_base.LoraBaseMixin ================================================ FILE: docs/source/en/api/loaders/peft.md ================================================ # PEFT Diffusers supports loading adapters such as [LoRA](../../tutorials/using_peft_for_inference) with the [PEFT](https://huggingface.co/docs/peft/index) library with the [`~loaders.peft.PeftAdapterMixin`] class. This allows modeling classes in Diffusers like [`UNet2DConditionModel`], [`SD3Transformer2DModel`] to operate with an adapter. > [!TIP] > Refer to the [Inference with PEFT](../../tutorials/using_peft_for_inference.md) tutorial for an overview of how to use PEFT in Diffusers for inference. ## PeftAdapterMixin [[autodoc]] loaders.peft.PeftAdapterMixin ================================================ FILE: docs/source/en/api/loaders/single_file.md ================================================ # Single files The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load: * a model stored in a single file, which is useful if you're working with models from the diffusion ecosystem, like Automatic1111, and commonly rely on a single-file layout to store and share models * a model stored in their originally distributed layout, which is useful if you're working with models finetuned with other services, and want to load it directly into Diffusers model objects and pipelines > [!TIP] > Read the [Model files and layouts](../../using-diffusers/other-formats) guide to learn more about the Diffusers-multifolder layout versus the single-file layout, and how to load models stored in these different layouts. ## Supported pipelines - [`StableDiffusionPipeline`] - [`StableDiffusionImg2ImgPipeline`] - [`StableDiffusionInpaintPipeline`] - [`StableDiffusionControlNetPipeline`] - [`StableDiffusionControlNetImg2ImgPipeline`] - [`StableDiffusionControlNetInpaintPipeline`] - [`StableDiffusionUpscalePipeline`] - [`StableDiffusionXLPipeline`] - [`StableDiffusionXLImg2ImgPipeline`] - [`StableDiffusionXLInpaintPipeline`] - [`StableDiffusionXLInstructPix2PixPipeline`] - [`StableDiffusionXLControlNetPipeline`] - [`StableDiffusionXLKDiffusionPipeline`] - [`StableDiffusion3Pipeline`] - [`LatentConsistencyModelPipeline`] - [`LatentConsistencyModelImg2ImgPipeline`] - [`StableDiffusionControlNetXSPipeline`] - [`StableDiffusionXLControlNetXSPipeline`] - [`LEditsPPPipelineStableDiffusion`] - [`LEditsPPPipelineStableDiffusionXL`] - [`PIAPipeline`] ## Supported models - [`UNet2DConditionModel`] - [`StableCascadeUNet`] - [`AutoencoderKL`] - [`ControlNetModel`] - [`SD3Transformer2DModel`] - [`FluxTransformer2DModel`] ## FromSingleFileMixin [[autodoc]] loaders.single_file.FromSingleFileMixin ## FromOriginalModelMixin [[autodoc]] loaders.single_file_model.FromOriginalModelMixin ================================================ FILE: docs/source/en/api/loaders/textual_inversion.md ================================================ # Textual Inversion Textual Inversion is a training method for personalizing models by learning new text embeddings from a few example images. The file produced from training is extremely small (a few KBs) and the new embeddings can be loaded into the text encoder. [`TextualInversionLoaderMixin`] provides a function for loading Textual Inversion embeddings from Diffusers and Automatic1111 into the text encoder and loading a special token to activate the embeddings. > [!TIP] > To learn more about how to load Textual Inversion embeddings, see the [Textual Inversion](../../using-diffusers/textual_inversion_inference) loading guide. ## TextualInversionLoaderMixin [[autodoc]] loaders.textual_inversion.TextualInversionLoaderMixin ================================================ FILE: docs/source/en/api/loaders/transformer_sd3.md ================================================ # SD3Transformer2D This class is useful when *only* loading weights into a [`SD3Transformer2DModel`]. If you need to load weights into the text encoder or a text encoder and SD3Transformer2DModel, check [`SD3LoraLoaderMixin`](lora#diffusers.loaders.SD3LoraLoaderMixin) class instead. The [`SD3Transformer2DLoadersMixin`] class currently only loads IP-Adapter weights, but will be used in the future to save weights and load LoRAs. > [!TIP] > To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) loading guide. ## SD3Transformer2DLoadersMixin [[autodoc]] loaders.transformer_sd3.SD3Transformer2DLoadersMixin - all - _load_ip_adapter_weights ================================================ FILE: docs/source/en/api/loaders/unet.md ================================================ # UNet Some training methods - like LoRA and Custom Diffusion - typically target the UNet's attention layers, but these training methods can also target other non-attention layers. Instead of training all of a model's parameters, only a subset of the parameters are trained, which is faster and more efficient. This class is useful if you're *only* loading weights into a UNet. If you need to load weights into the text encoder or a text encoder and UNet, try using the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] function instead. The [`UNet2DConditionLoadersMixin`] class provides functions for loading and saving weights, fusing and unfusing LoRAs, disabling and enabling LoRAs, and setting and deleting adapters. > [!TIP] > To learn more about how to load LoRA weights, see the [LoRA](../../tutorials/using_peft_for_inference) guide. ## UNet2DConditionLoadersMixin [[autodoc]] loaders.unet.UNet2DConditionLoadersMixin ================================================ FILE: docs/source/en/api/logging.md ================================================ # Logging 🤗 Diffusers has a centralized logging system to easily manage the verbosity of the library. The default verbosity is set to `WARNING`. To change the verbosity level, use one of the direct setters. For instance, to change the verbosity to the `INFO` level. ```python import diffusers diffusers.logging.set_verbosity_info() ``` You can also use the environment variable `DIFFUSERS_VERBOSITY` to override the default verbosity. You can set it to one of the following: `debug`, `info`, `warning`, `error`, `critical`. For example: ```bash DIFFUSERS_VERBOSITY=error ./myprogram.py ``` Additionally, some `warnings` can be disabled by setting the environment variable `DIFFUSERS_NO_ADVISORY_WARNINGS` to a true value, like `1`. This disables any warning logged by [`logger.warning_advice`]. For example: ```bash DIFFUSERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py ``` Here is an example of how to use the same logger as the library in your own module or script: ```python from diffusers.utils import logging logging.set_verbosity_info() logger = logging.get_logger("diffusers") logger.info("INFO") logger.warning("WARN") ``` All methods of the logging module are documented below. The main methods are [`logging.get_verbosity`] to get the current level of verbosity in the logger and [`logging.set_verbosity`] to set the verbosity to the level of your choice. In order from the least verbose to the most verbose: | Method | Integer value | Description | |----------------------------------------------------------:|--------------:|----------------------------------------------------:| | `diffusers.logging.CRITICAL` or `diffusers.logging.FATAL` | 50 | only report the most critical errors | | `diffusers.logging.ERROR` | 40 | only report errors | | `diffusers.logging.WARNING` or `diffusers.logging.WARN` | 30 | only report errors and warnings (default) | | `diffusers.logging.INFO` | 20 | only report errors, warnings, and basic information | | `diffusers.logging.DEBUG` | 10 | report all information | By default, `tqdm` progress bars are displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] are used to enable or disable this behavior. ## Base setters [[autodoc]] utils.logging.set_verbosity_error [[autodoc]] utils.logging.set_verbosity_warning [[autodoc]] utils.logging.set_verbosity_info [[autodoc]] utils.logging.set_verbosity_debug ## Other functions [[autodoc]] utils.logging.get_verbosity [[autodoc]] utils.logging.set_verbosity [[autodoc]] utils.logging.get_logger [[autodoc]] utils.logging.enable_default_handler [[autodoc]] utils.logging.disable_default_handler [[autodoc]] utils.logging.enable_explicit_format [[autodoc]] utils.logging.reset_format [[autodoc]] utils.logging.enable_progress_bar [[autodoc]] utils.logging.disable_progress_bar ================================================ FILE: docs/source/en/api/models/allegro_transformer3d.md ================================================ # AllegroTransformer3DModel A Diffusion Transformer model for 3D data from [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. The model can be loaded with the following code snippet. ```python from diffusers import AllegroTransformer3DModel transformer = AllegroTransformer3DModel.from_pretrained("rhymes-ai/Allegro", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## AllegroTransformer3DModel [[autodoc]] AllegroTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/asymmetricautoencoderkl.md ================================================ # AsymmetricAutoencoderKL Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://huggingface.co/papers/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua. The abstract from the paper is: *StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN* Evaluation results can be found in section 4.1 of the original paper. ## Available checkpoints * [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5) * [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2) ## Example Usage ```python from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline from diffusers.utils import load_image, make_image_grid prompt = "a photo of a person with beard" img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png" mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png" original_image = load_image(img_url).resize((512, 512)) mask_image = load_image(mask_url).resize((512, 512)) pipe = StableDiffusionInpaintPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting") pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5") pipe.to("cuda") image = pipe(prompt=prompt, image=original_image, mask_image=mask_image).images[0] make_image_grid([original_image, mask_image, image], rows=1, cols=3) ``` ## AsymmetricAutoencoderKL [[autodoc]] models.autoencoders.autoencoder_asym_kl.AsymmetricAutoencoderKL ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/aura_flow_transformer2d.md ================================================ # AuraFlowTransformer2DModel A Transformer model for image-like data from [AuraFlow](https://blog.fal.ai/auraflow/). ## AuraFlowTransformer2DModel [[autodoc]] AuraFlowTransformer2DModel ================================================ FILE: docs/source/en/api/models/auto_model.md ================================================ # AutoModel [`AutoModel`] automatically retrieves the correct model class from the checkpoint `config.json` file. ## AutoModel [[autodoc]] AutoModel - all - from_pretrained ================================================ FILE: docs/source/en/api/models/autoencoder_dc.md ================================================ # AutoencoderDC The 2D Autoencoder model used in [SANA](https://huggingface.co/papers/2410.10629) and introduced in [DCAE](https://huggingface.co/papers/2410.10733) by authors Junyu Chen\*, Han Cai\*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han from MIT HAN Lab. The abstract from the paper is: *We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at [this https URL](https://github.com/mit-han-lab/efficientvit).* The following DCAE models are released and supported in Diffusers. | Diffusers format | Original format | |:----------------:|:---------------:| | [`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-sana-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0) | [`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-in-1.0) | [`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f32c32-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f32c32-mix-1.0) | [`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-in-1.0) | [`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f64c128-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f64c128-mix-1.0) | [`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-in-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0) | [`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers) | [`mit-han-lab/dc-ae-f128c512-mix-1.0`](https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0) This model was contributed by [lawrence-cj](https://github.com/lawrence-cj). Load a model in Diffusers format with [`~ModelMixin.from_pretrained`]. ```python from diffusers import AutoencoderDC ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda") ``` ## Load a model in Diffusers via `from_single_file` ```python from difusers import AutoencoderDC ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0/blob/main/model.safetensors" model = AutoencoderDC.from_single_file(ckpt_path) ``` The `AutoencoderDC` model has `in` and `mix` single file checkpoint variants that have matching checkpoint keys, but use different scaling factors. It is not possible for Diffusers to automatically infer the correct config file to use with the model based on just the checkpoint and will default to configuring the model using the `mix` variant config file. To override the automatically determined config, please use the `config` argument when using single file loading with `in` variant checkpoints. ```python from diffusers import AutoencoderDC ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0/blob/main/model.safetensors" model = AutoencoderDC.from_single_file(ckpt_path, config="mit-han-lab/dc-ae-f128c512-in-1.0-diffusers") ``` ## AutoencoderDC [[autodoc]] AutoencoderDC - encode - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_kl_hunyuan_video.md ================================================ # AutoencoderKLHunyuanVideo The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLHunyuanVideo vae = AutoencoderKLHunyuanVideo.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="vae", torch_dtype=torch.float16) ``` ## AutoencoderKLHunyuanVideo [[autodoc]] AutoencoderKLHunyuanVideo - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_kl_hunyuan_video15.md ================================================ # AutoencoderKLHunyuanVideo15 The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5) by Tencent. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLHunyuanVideo15 vae = AutoencoderKLHunyuanVideo15.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v", subfolder="vae", torch_dtype=torch.float32) # make sure to enable tiling to avoid OOM vae.enable_tiling() ``` ## AutoencoderKLHunyuanVideo15 [[autodoc]] AutoencoderKLHunyuanVideo15 - decode - encode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_kl_hunyuanimage.md ================================================ # AutoencoderKLHunyuanImage The 2D variational autoencoder (VAE) model with KL loss used in [HunyuanImage2.1]. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLHunyuanImage vae = AutoencoderKLHunyuanImage.from_pretrained("hunyuanvideo-community/HunyuanImage-2.1-Diffusers", subfolder="vae", torch_dtype=torch.bfloat16) ``` ## AutoencoderKLHunyuanImage [[autodoc]] AutoencoderKLHunyuanImage - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_kl_hunyuanimage_refiner.md ================================================ # AutoencoderKLHunyuanImageRefiner The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanImage2.1](https://github.com/Tencent-Hunyuan/HunyuanImage-2.1) for its refiner pipeline. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLHunyuanImageRefiner vae = AutoencoderKLHunyuanImageRefiner.from_pretrained("hunyuanvideo-community/HunyuanImage-2.1-Refiner-Diffusers", subfolder="vae", torch_dtype=torch.bfloat16) ``` ## AutoencoderKLHunyuanImageRefiner [[autodoc]] AutoencoderKLHunyuanImageRefiner - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_kl_wan.md ================================================ # AutoencoderKLWan The 3D variational autoencoder (VAE) model with KL loss used in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLWan vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32) ``` ## AutoencoderKLWan [[autodoc]] AutoencoderKLWan - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_oobleck.md ================================================ # AutoencoderOobleck The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms. The abstract from the paper is: *Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* ## AutoencoderOobleck [[autodoc]] AutoencoderOobleck - decode - encode - all ## OobleckDecoderOutput [[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput ## OobleckDecoderOutput [[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput ## AutoencoderOobleckOutput [[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput ================================================ FILE: docs/source/en/api/models/autoencoder_rae.md ================================================ # AutoencoderRAE The Representation Autoencoder (RAE) model introduced in [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690) by Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie from NYU VISIONx. RAE combines a frozen pretrained vision encoder (DINOv2, SigLIP2, or MAE) with a trainable ViT-MAE-style decoder. In the two-stage RAE training recipe, the autoencoder is trained in stage 1 (reconstruction), and then a diffusion model is trained on the resulting latent space in stage 2 (generation). The following RAE models are released and supported in Diffusers: | Model | Encoder | Latent shape (224px input) | |:------|:--------|:---------------------------| | [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08) | DINOv2-base | 768 x 16 x 16 | | [`nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08-i512) | DINOv2-base (512px) | 768 x 32 x 32 | | [`nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-small-ViTXL-n08) | DINOv2-small | 384 x 16 x 16 | | [`nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-dinov2-wReg-large-ViTXL-n08) | DINOv2-large | 1024 x 16 x 16 | | [`nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-siglip2-base-p16-i256-ViTXL-n08) | SigLIP2-base | 768 x 16 x 16 | | [`nyu-visionx/RAE-mae-base-p16-ViTXL-n08`](https://huggingface.co/nyu-visionx/RAE-mae-base-p16-ViTXL-n08) | MAE-base | 768 x 16 x 16 | ## Loading a pretrained model ```python from diffusers import AutoencoderRAE model = AutoencoderRAE.from_pretrained( "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" ).to("cuda").eval() ``` ## Encoding and decoding a real image ```python import torch from diffusers import AutoencoderRAE from diffusers.utils import load_image from torchvision.transforms.functional import to_tensor, to_pil_image model = AutoencoderRAE.from_pretrained( "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" ).to("cuda").eval() image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = image.convert("RGB").resize((224, 224)) x = to_tensor(image).unsqueeze(0).to("cuda") # (1, 3, 224, 224), values in [0, 1] with torch.no_grad(): latents = model.encode(x).latent # (1, 768, 16, 16) recon = model.decode(latents).sample # (1, 3, 256, 256) recon_image = to_pil_image(recon[0].clamp(0, 1).cpu()) recon_image.save("recon.png") ``` ## Latent normalization Some pretrained checkpoints include per-channel `latents_mean` and `latents_std` statistics for normalizing the latent space. When present, `encode` and `decode` automatically apply the normalization and denormalization, respectively. ```python model = AutoencoderRAE.from_pretrained( "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" ).to("cuda").eval() # Latent normalization is handled automatically inside encode/decode # when the checkpoint config includes latents_mean/latents_std. with torch.no_grad(): latents = model.encode(x).latent # normalized latents recon = model.decode(latents).sample ``` ## AutoencoderRAE [[autodoc]] AutoencoderRAE - encode - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoder_tiny.md ================================================ # Tiny AutoEncoder Tiny AutoEncoder for Stable Diffusion (TAESD) was introduced in [madebyollin/taesd](https://github.com/madebyollin/taesd) by Ollin Boer Bohan. It is a tiny distilled version of Stable Diffusion's VAE that can quickly decode the latents in a [`StableDiffusionPipeline`] or [`StableDiffusionXLPipeline`] almost instantly. To use with Stable Diffusion v-2.1: ```python import torch from diffusers import DiffusionPipeline, AutoencoderTiny pipe = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16 ) pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "slice of delicious New York-style berry cheesecake" image = pipe(prompt, num_inference_steps=25).images[0] image ``` To use with Stable Diffusion XL 1.0 ```python import torch from diffusers import DiffusionPipeline, AutoencoderTiny pipe = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ) pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "slice of delicious New York-style berry cheesecake" image = pipe(prompt, num_inference_steps=25).images[0] image ``` ## AutoencoderTiny [[autodoc]] AutoencoderTiny ## AutoencoderTinyOutput [[autodoc]] models.autoencoders.autoencoder_tiny.AutoencoderTinyOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl.md ================================================ # AutoencoderKL The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://huggingface.co/papers/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images. The abstract from the paper is: *How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.* ## Loading from the original format By default the [`AutoencoderKL`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: ```py from diffusers import AutoencoderKL url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be a local file model = AutoencoderKL.from_single_file(url) ``` ## AutoencoderKL [[autodoc]] AutoencoderKL - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_allegro.md ================================================ # AutoencoderKLAllegro The 3D variational autoencoder (VAE) model with KL loss used in [Allegro](https://github.com/rhymes-ai/Allegro) was introduced in [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) by RhymesAI. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLAllegro vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32).to("cuda") ``` ## AutoencoderKLAllegro [[autodoc]] AutoencoderKLAllegro - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_audio_ltx_2.md ================================================ # AutoencoderKLLTX2Audio The 3D variational autoencoder (VAE) model with KL loss used in [LTX-2](https://huggingface.co/Lightricks/LTX-2) was introduced by Lightricks. This is for encoding and decoding audio latent representations. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLLTX2Audio vae = AutoencoderKLLTX2Audio.from_pretrained("Lightricks/LTX-2", subfolder="vae", torch_dtype=torch.float32).to("cuda") ``` ## AutoencoderKLLTX2Audio [[autodoc]] AutoencoderKLLTX2Audio - encode - decode - all ================================================ FILE: docs/source/en/api/models/autoencoderkl_cogvideox.md ================================================ # AutoencoderKLCogVideoX The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLCogVideoX vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda") ``` ## AutoencoderKLCogVideoX [[autodoc]] AutoencoderKLCogVideoX - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_cosmos.md ================================================ # AutoencoderKLCosmos [Cosmos Tokenizers](https://github.com/NVIDIA/Cosmos-Tokenizer). Supported models: - [nvidia/Cosmos-1.0-Tokenizer-CV8x8x8](https://huggingface.co/nvidia/Cosmos-1.0-Tokenizer-CV8x8x8) The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLCosmos vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae") ``` ## AutoencoderKLCosmos [[autodoc]] AutoencoderKLCosmos - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_ltx_2.md ================================================ # AutoencoderKLLTX2Video The 3D variational autoencoder (VAE) model with KL loss used in [LTX-2](https://huggingface.co/Lightricks/LTX-2) was introduced by Lightricks. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLLTX2Video vae = AutoencoderKLLTX2Video.from_pretrained("Lightricks/LTX-2", subfolder="vae", torch_dtype=torch.float32).to("cuda") ``` ## AutoencoderKLLTX2Video [[autodoc]] AutoencoderKLLTX2Video - decode - encode - all ================================================ FILE: docs/source/en/api/models/autoencoderkl_ltx_video.md ================================================ # AutoencoderKLLTXVideo The 3D variational autoencoder (VAE) model with KL loss used in [LTX](https://huggingface.co/Lightricks/LTX-Video) was introduced by Lightricks. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLLTXVideo vae = AutoencoderKLLTXVideo.from_pretrained("Lightricks/LTX-Video", subfolder="vae", torch_dtype=torch.float32).to("cuda") ``` ## AutoencoderKLLTXVideo [[autodoc]] AutoencoderKLLTXVideo - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_magvit.md ================================================ # AutoencoderKLMagvit The 3D variational autoencoder (VAE) model with KL loss used in [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLMagvit vae = AutoencoderKLMagvit.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="vae", torch_dtype=torch.float16).to("cuda") ``` ## AutoencoderKLMagvit [[autodoc]] AutoencoderKLMagvit - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_mochi.md ================================================ # AutoencoderKLMochi The 3D variational autoencoder (VAE) model with KL loss used in [Mochi](https://github.com/genmoai/models) was introduced in [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Tsinghua University & ZhipuAI. The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLMochi vae = AutoencoderKLMochi.from_pretrained("genmo/mochi-1-preview", subfolder="vae", torch_dtype=torch.float32).to("cuda") ``` ## AutoencoderKLMochi [[autodoc]] AutoencoderKLMochi - decode - all ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/autoencoderkl_qwenimage.md ================================================ # AutoencoderKLQwenImage The model can be loaded with the following code snippet. ```python from diffusers import AutoencoderKLQwenImage vae = AutoencoderKLQwenImage.from_pretrained("Qwen/QwenImage-20B", subfolder="vae") ``` ## AutoencoderKLQwenImage [[autodoc]] AutoencoderKLQwenImage - decode - encode - all ## AutoencoderKLOutput [[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput ## DecoderOutput [[autodoc]] models.autoencoders.vae.DecoderOutput ================================================ FILE: docs/source/en/api/models/bria_transformer.md ================================================ # BriaTransformer2DModel A modified flux Transformer model from [Bria](https://huggingface.co/briaai/BRIA-3.2) ## BriaTransformer2DModel [[autodoc]] BriaTransformer2DModel ================================================ FILE: docs/source/en/api/models/chroma_transformer.md ================================================ # ChromaTransformer2DModel A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma1-HD) ## ChromaTransformer2DModel [[autodoc]] ChromaTransformer2DModel ================================================ FILE: docs/source/en/api/models/chronoedit_transformer_3d.md ================================================ # ChronoEditTransformer3DModel A Diffusion Transformer model for 3D video-like data from [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling. > **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. The model can be loaded with the following code snippet. ```python from diffusers import ChronoEditTransformer3DModel transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## ChronoEditTransformer3DModel [[autodoc]] ChronoEditTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/cogvideox_transformer3d.md ================================================ # CogVideoXTransformer3DModel A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI. The model can be loaded with the following code snippet. ```python from diffusers import CogVideoXTransformer3DModel transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda") ``` ## CogVideoXTransformer3DModel [[autodoc]] CogVideoXTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/cogview3plus_transformer2d.md ================================================ # CogView3PlusTransformer2DModel A Diffusion Transformer model for 2D data from [CogView3Plus](https://github.com/THUDM/CogView3) was introduced in [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) by Tsinghua University & ZhipuAI. The model can be loaded with the following code snippet. ```python from diffusers import CogView3PlusTransformer2DModel transformer = CogView3PlusTransformer2DModel.from_pretrained("THUDM/CogView3Plus-3b", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## CogView3PlusTransformer2DModel [[autodoc]] CogView3PlusTransformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/cogview4_transformer2d.md ================================================ # CogView4Transformer2DModel A Diffusion Transformer model for 2D data from [CogView4]() The model can be loaded with the following code snippet. ```python from diffusers import CogView4Transformer2DModel transformer = CogView4Transformer2DModel.from_pretrained("THUDM/CogView4-6B", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## CogView4Transformer2DModel [[autodoc]] CogView4Transformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/consisid_transformer3d.md ================================================ # ConsisIDTransformer3DModel A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://huggingface.co/papers/2411.17440) by Peking University & University of Rochester & etc. The model can be loaded with the following code snippet. ```python from diffusers import ConsisIDTransformer3DModel transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## ConsisIDTransformer3DModel [[autodoc]] ConsisIDTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/consistency_decoder_vae.md ================================================ # Consistency Decoder Consistency decoder can be used to decode the latents from the denoising UNet in the [`StableDiffusionPipeline`]. This decoder was introduced in the [DALL-E 3 technical report](https://openai.com/dall-e-3). The original codebase can be found at [openai/consistencydecoder](https://github.com/openai/consistencydecoder). > [!WARNING] > Inference is only supported for 2 iterations as of now. The pipeline could not have been contributed without the help of [madebyollin](https://github.com/madebyollin) and [mrsteyk](https://github.com/mrsteyk) from [this issue](https://github.com/openai/consistencydecoder/issues/1). ## ConsistencyDecoderVAE [[autodoc]] ConsistencyDecoderVAE - all - decode ================================================ FILE: docs/source/en/api/models/controlnet.md ================================================ # ControlNetModel The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* ## Loading from the original format By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`], but it can also be loaded from the original format using [`FromOriginalModelMixin.from_single_file`] as follows: ```py from diffusers import StableDiffusionControlNetPipeline, ControlNetModel url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path controlnet = ControlNetModel.from_single_file(url) url = "https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) ``` ## Loading from Control LoRA Control-LoRA is introduced by Stability AI in [stabilityai/control-lora](https://huggingface.co/stabilityai/control-lora) by adding low-rank parameter efficient fine tuning to ControlNet. This approach offers a more efficient and compact method to bring model control to a wider variety of consumer GPUs. ```py from diffusers import ControlNetModel, UNet2DConditionModel lora_id = "stabilityai/control-lora" lora_filename = "control-LoRAs-rank128/control-lora-canny-rank128.safetensors" unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet", torch_dtype=torch.bfloat16).to("cuda") controlnet = ControlNetModel.from_unet(unet).to(device="cuda", dtype=torch.bfloat16) controlnet.load_lora_adapter(lora_id, weight_name=lora_filename, prefix=None, controlnet_config=controlnet.config) ``` ## ControlNetModel [[autodoc]] ControlNetModel ## ControlNetOutput [[autodoc]] models.controlnets.controlnet.ControlNetOutput ================================================ FILE: docs/source/en/api/models/controlnet_flux.md ================================================ # FluxControlNetModel FluxControlNetModel is an implementation of ControlNet for Flux.1. The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* ## Loading from the original format By default the [`FluxControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. ```py from diffusers import FluxControlNetPipeline from diffusers.models import FluxControlNetModel, FluxMultiControlNetModel controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) controlnet = FluxControlNetModel.from_pretrained("InstantX/FLUX.1-dev-Controlnet-Canny") controlnet = FluxMultiControlNetModel([controlnet]) pipe = FluxControlNetPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", controlnet=controlnet) ``` ## FluxControlNetModel [[autodoc]] FluxControlNetModel ## FluxControlNetOutput [[autodoc]] models.controlnets.controlnet_flux.FluxControlNetOutput ================================================ FILE: docs/source/en/api/models/controlnet_hunyuandit.md ================================================ # HunyuanDiT2DControlNetModel HunyuanDiT2DControlNetModel is an implementation of ControlNet for [Hunyuan-DiT](https://huggingface.co/papers/2405.08748). ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). ## Example For Loading HunyuanDiT2DControlNetModel ```py from diffusers import HunyuanDiT2DControlNetModel import torch controlnet = HunyuanDiT2DControlNetModel.from_pretrained("Tencent-Hunyuan/HunyuanDiT-v1.1-ControlNet-Diffusers-Pose", torch_dtype=torch.float16) ``` ## HunyuanDiT2DControlNetModel [[autodoc]] HunyuanDiT2DControlNetModel ================================================ FILE: docs/source/en/api/models/controlnet_sana.md ================================================ # SanaControlNetModel The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This model was contributed by [ishan24](https://huggingface.co/ishan24). ❤️ The original codebase can be found at [NVlabs/Sana](https://github.com/NVlabs/Sana), and you can find official ControlNet checkpoints on [Efficient-Large-Model's](https://huggingface.co/Efficient-Large-Model) Hub profile. ## SanaControlNetModel [[autodoc]] SanaControlNetModel ## SanaControlNetOutput [[autodoc]] models.controlnets.controlnet_sana.SanaControlNetOutput ================================================ FILE: docs/source/en/api/models/controlnet_sd3.md ================================================ # SD3ControlNetModel SD3ControlNetModel is an implementation of ControlNet for Stable Diffusion 3. The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* ## Loading from the original format By default the [`SD3ControlNetModel`] should be loaded with [`~ModelMixin.from_pretrained`]. ```py from diffusers import StableDiffusion3ControlNetPipeline from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny") pipe = StableDiffusion3ControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet) ``` ## SD3ControlNetModel [[autodoc]] SD3ControlNetModel ## SD3ControlNetOutput [[autodoc]] models.controlnets.controlnet_sd3.SD3ControlNetOutput ================================================ FILE: docs/source/en/api/models/controlnet_sparsectrl.md ================================================ # SparseControlNetModel SparseControlNetModel is an implementation of ControlNet for [AnimateDiff](https://huggingface.co/papers/2307.04725). ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. The SparseCtrl version of ControlNet was introduced in [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://huggingface.co/papers/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. The abstract from the paper is: *The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).* ## Example for loading SparseControlNetModel ```python import torch from diffusers import SparseControlNetModel # fp32 variant in float16 # 1. Scribble checkpoint controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16) # 2. RGB checkpoint controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-rgb", torch_dtype=torch.float16) # For loading fp16 variant, pass `variant="fp16"` as an additional parameter ``` ## SparseControlNetModel [[autodoc]] SparseControlNetModel ## SparseControlNetOutput [[autodoc]] models.controlnets.controlnet_sparsectrl.SparseControlNetOutput ================================================ FILE: docs/source/en/api/models/controlnet_union.md ================================================ # ControlNetUnionModel ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL. The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation. *We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.* ## Loading By default the [`ControlNetUnionModel`] should be loaded with [`~ModelMixin.from_pretrained`]. ```py from diffusers import StableDiffusionXLControlNetUnionPipeline, ControlNetUnionModel controlnet = ControlNetUnionModel.from_pretrained("xinsir/controlnet-union-sdxl-1.0") pipe = StableDiffusionXLControlNetUnionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet) ``` ## ControlNetUnionModel [[autodoc]] ControlNetUnionModel ================================================ FILE: docs/source/en/api/models/cosmos_transformer3d.md ================================================ # CosmosTransformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA. The model can be loaded with the following code snippet. ```python from diffusers import CosmosTransformer3DModel transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## CosmosTransformer3DModel [[autodoc]] CosmosTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/dit_transformer2d.md ================================================ # DiTTransformer2DModel A Transformer model for image-like data from [DiT](https://huggingface.co/papers/2212.09748). ## DiTTransformer2DModel [[autodoc]] DiTTransformer2DModel ================================================ FILE: docs/source/en/api/models/easyanimate_transformer3d.md ================================================ # EasyAnimateTransformer3DModel A Diffusion Transformer model for 3D data from [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI. The model can be loaded with the following code snippet. ```python from diffusers import EasyAnimateTransformer3DModel transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda") ``` ## EasyAnimateTransformer3DModel [[autodoc]] EasyAnimateTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/flux2_transformer.md ================================================ # Flux2Transformer2DModel A Transformer model for image-like data from [Flux2](https://hf.co/black-forest-labs/FLUX.2-dev). ## Flux2Transformer2DModel [[autodoc]] Flux2Transformer2DModel ## Flux2Transformer2DModelOutput [[autodoc]] models.transformers.transformer_flux2.Flux2Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/flux_transformer.md ================================================ # FluxTransformer2DModel A Transformer model for image-like data from [Flux](https://blackforestlabs.ai/announcing-black-forest-labs/). ## FluxTransformer2DModel [[autodoc]] FluxTransformer2DModel ================================================ FILE: docs/source/en/api/models/glm_image_transformer2d.md ================================================ # GlmImageTransformer2DModel A Diffusion Transformer model for 2D data from [GlmImageTransformer2DModel] (TODO). ## GlmImageTransformer2DModel [[autodoc]] GlmImageTransformer2DModel ================================================ FILE: docs/source/en/api/models/helios_transformer3d.md ================================================ # HeliosTransformer3DModel A 14B Real-Time Autogressive Diffusion Transformer model (support T2V, I2V and V2V) for 3D video-like data from [Helios](https://github.com/PKU-YuanGroup/Helios) was introduced in [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) by Peking University & ByteDance & etc. The model can be loaded with the following code snippet. ```python from diffusers import HeliosTransformer3DModel # Best Quality transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Base", subfolder="transformer", torch_dtype=torch.bfloat16) # Intermediate Weight transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Mid", subfolder="transformer", torch_dtype=torch.bfloat16) # Best Efficiency transformer = HeliosTransformer3DModel.from_pretrained("BestWishYsh/Helios-Distilled", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## HeliosTransformer3DModel [[autodoc]] HeliosTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/hidream_image_transformer.md ================================================ # HiDreamImageTransformer2DModel A Transformer model for image-like data from [HiDream-I1](https://huggingface.co/HiDream-ai). The model can be loaded with the following code snippet. ```python from diffusers import HiDreamImageTransformer2DModel transformer = HiDreamImageTransformer2DModel.from_pretrained("HiDream-ai/HiDream-I1-Full", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## Loading GGUF quantized checkpoints for HiDream-I1 GGUF checkpoints for the `HiDreamImageTransformer2DModel` can be loaded using `~FromOriginalModelMixin.from_single_file` ```python import torch from diffusers import GGUFQuantizationConfig, HiDreamImageTransformer2DModel ckpt_path = "https://huggingface.co/city96/HiDream-I1-Dev-gguf/blob/main/hidream-i1-dev-Q2_K.gguf" transformer = HiDreamImageTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16 ) ``` ## HiDreamImageTransformer2DModel [[autodoc]] HiDreamImageTransformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/hunyuan_transformer2d.md ================================================ # HunyuanDiT2DModel A Diffusion Transformer model for 2D data from [Hunyuan-DiT](https://github.com/Tencent/HunyuanDiT). ## HunyuanDiT2DModel [[autodoc]] HunyuanDiT2DModel ================================================ FILE: docs/source/en/api/models/hunyuan_video15_transformer_3d.md ================================================ # HunyuanVideo15Transformer3DModel A Diffusion Transformer model for 3D video-like data used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5). The model can be loaded with the following code snippet. ```python from diffusers import HunyuanVideo15Transformer3DModel transformer = HunyuanVideo15Transformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v" subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## HunyuanVideo15Transformer3DModel [[autodoc]] HunyuanVideo15Transformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/hunyuan_video_transformer_3d.md ================================================ # HunyuanVideoTransformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent. The model can be loaded with the following code snippet. ```python from diffusers import HunyuanVideoTransformer3DModel transformer = HunyuanVideoTransformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## HunyuanVideoTransformer3DModel [[autodoc]] HunyuanVideoTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/hunyuanimage_transformer_2d.md ================================================ # HunyuanImageTransformer2DModel A Diffusion Transformer model for [HunyuanImage2.1](https://github.com/Tencent-Hunyuan/HunyuanImage-2.1). The model can be loaded with the following code snippet. ```python from diffusers import HunyuanImageTransformer2DModel transformer = HunyuanImageTransformer2DModel.from_pretrained("hunyuanvideo-community/HunyuanImage-2.1-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## HunyuanImageTransformer2DModel [[autodoc]] HunyuanImageTransformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/latte_transformer3d.md ================================================ ## LatteTransformer3DModel A Diffusion Transformer model for 3D data from [Latte](https://github.com/Vchitect/Latte). ## LatteTransformer3DModel [[autodoc]] LatteTransformer3DModel ================================================ FILE: docs/source/en/api/models/longcat_image_transformer2d.md ================================================ # LongCatImageTransformer2DModel The model can be loaded with the following code snippet. ```python from diffusers import LongCatImageTransformer2DModel transformer = LongCatImageTransformer2DModel.from_pretrained("meituan-longcat/LongCat-Image ", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## LongCatImageTransformer2DModel [[autodoc]] LongCatImageTransformer2DModel ================================================ FILE: docs/source/en/api/models/ltx2_video_transformer3d.md ================================================ # LTX2VideoTransformer3DModel A Diffusion Transformer model for 3D data from [LTX](https://huggingface.co/Lightricks/LTX-2) was introduced by Lightricks. The model can be loaded with the following code snippet. ```python from diffusers import LTX2VideoTransformer3DModel transformer = LTX2VideoTransformer3DModel.from_pretrained("Lightricks/LTX-2", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## LTX2VideoTransformer3DModel [[autodoc]] LTX2VideoTransformer3DModel ================================================ FILE: docs/source/en/api/models/ltx_video_transformer3d.md ================================================ # LTXVideoTransformer3DModel A Diffusion Transformer model for 3D data from [LTX](https://huggingface.co/Lightricks/LTX-Video) was introduced by Lightricks. The model can be loaded with the following code snippet. ```python from diffusers import LTXVideoTransformer3DModel transformer = LTXVideoTransformer3DModel.from_pretrained("Lightricks/LTX-Video", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda") ``` ## LTXVideoTransformer3DModel [[autodoc]] LTXVideoTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/lumina2_transformer2d.md ================================================ # Lumina2Transformer2DModel A Diffusion Transformer model for 3D video-like data was introduced in [Lumina Image 2.0](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) by Alpha-VLLM. The model can be loaded with the following code snippet. ```python from diffusers import Lumina2Transformer2DModel transformer = Lumina2Transformer2DModel.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## Lumina2Transformer2DModel [[autodoc]] Lumina2Transformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/lumina_nextdit2d.md ================================================ # LuminaNextDiT2DModel A Next Version of Diffusion Transformer model for 2D data from [Lumina-T2X](https://github.com/Alpha-VLLM/Lumina-T2X). ## LuminaNextDiT2DModel [[autodoc]] LuminaNextDiT2DModel ================================================ FILE: docs/source/en/api/models/mochi_transformer3d.md ================================================ # MochiTransformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [Mochi-1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Genmo. The model can be loaded with the following code snippet. ```python from diffusers import MochiTransformer3DModel transformer = MochiTransformer3DModel.from_pretrained("genmo/mochi-1-preview", subfolder="transformer", torch_dtype=torch.float16).to("cuda") ``` ## MochiTransformer3DModel [[autodoc]] MochiTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/omnigen_transformer.md ================================================ # OmniGenTransformer2DModel A Transformer model that accepts multimodal instructions to generate images for [OmniGen](https://github.com/VectorSpaceLab/OmniGen/). The abstract from the paper is: *The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.* ```python import torch from diffusers import OmniGenTransformer2DModel transformer = OmniGenTransformer2DModel.from_pretrained("Shitao/OmniGen-v1-diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## OmniGenTransformer2DModel [[autodoc]] OmniGenTransformer2DModel ================================================ FILE: docs/source/en/api/models/overview.md ================================================ # Models 🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\). All models are built from the base [`ModelMixin`] class which is a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub. ## ModelMixin [[autodoc]] ModelMixin ## PushToHubMixin [[autodoc]] utils.PushToHubMixin ================================================ FILE: docs/source/en/api/models/ovisimage_transformer2d.md ================================================ # OvisImageTransformer2DModel The model can be loaded with the following code snippet. ```python from diffusers import OvisImageTransformer2DModel transformer = OvisImageTransformer2DModel.from_pretrained("AIDC-AI/Ovis-Image-7B", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## OvisImageTransformer2DModel [[autodoc]] OvisImageTransformer2DModel ================================================ FILE: docs/source/en/api/models/pixart_transformer2d.md ================================================ # PixArtTransformer2DModel A Transformer model for image-like data from [PixArt-Alpha](https://huggingface.co/papers/2310.00426) and [PixArt-Sigma](https://huggingface.co/papers/2403.04692). ## PixArtTransformer2DModel [[autodoc]] PixArtTransformer2DModel ================================================ FILE: docs/source/en/api/models/prior_transformer.md ================================================ # PriorTransformer The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process. The abstract from the paper is: *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* ## PriorTransformer [[autodoc]] PriorTransformer ## PriorTransformerOutput [[autodoc]] models.transformers.prior_transformer.PriorTransformerOutput ================================================ FILE: docs/source/en/api/models/qwenimage_transformer2d.md ================================================ # QwenImageTransformer2DModel The model can be loaded with the following code snippet. ```python from diffusers import QwenImageTransformer2DModel transformer = QwenImageTransformer2DModel.from_pretrained("Qwen/QwenImage-20B", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## QwenImageTransformer2DModel [[autodoc]] QwenImageTransformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/sana_transformer2d.md ================================================ # SanaTransformer2DModel A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han. The abstract from the paper is: *We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.* The model can be loaded with the following code snippet. ```python from diffusers import SanaTransformer2DModel transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## SanaTransformer2DModel [[autodoc]] SanaTransformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/sana_video_transformer3d.md ================================================ # SanaVideoTransformer3DModel A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie. The abstract from the paper is: *We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.* The model can be loaded with the following code snippet. ```python from diffusers import SanaVideoTransformer3DModel import torch transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## SanaVideoTransformer3DModel [[autodoc]] SanaVideoTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/sd3_transformer2d.md ================================================ # SD3 Transformer Model The Transformer model introduced in [Stable Diffusion 3](https://hf.co/papers/2403.03206). Its novelty lies in the MMDiT transformer block. ## SD3Transformer2DModel [[autodoc]] SD3Transformer2DModel ================================================ FILE: docs/source/en/api/models/skyreels_v2_transformer_3d.md ================================================ # SkyReelsV2Transformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [SkyReels-V2](https://github.com/SkyworkAI/SkyReels-V2) by the Skywork AI. The model can be loaded with the following code snippet. ```python from diffusers import SkyReelsV2Transformer3DModel transformer = SkyReelsV2Transformer3DModel.from_pretrained("Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## SkyReelsV2Transformer3DModel [[autodoc]] SkyReelsV2Transformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/stable_audio_transformer.md ================================================ # StableAudioDiTModel A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358). ## StableAudioDiTModel [[autodoc]] StableAudioDiTModel ================================================ FILE: docs/source/en/api/models/stable_cascade_unet.md ================================================ # StableCascadeUNet A UNet model from the [Stable Cascade pipeline](../pipelines/stable_cascade.md). ## StableCascadeUNet [[autodoc]] models.unets.unet_stable_cascade.StableCascadeUNet ================================================ FILE: docs/source/en/api/models/transformer2d.md ================================================ # Transformer2DModel A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs. When the input is **continuous**: 1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`. 2. Apply the Transformer blocks in the standard way. 3. Reshape to image. When the input is **discrete**: > [!TIP] > It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked. 1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings. 2. Apply the Transformer blocks in the standard way. 3. Predict classes of unnoised image. ## Transformer2DModel [[autodoc]] Transformer2DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/transformer_bria_fibo.md ================================================ # BriaFiboTransformer2DModel A modified flux Transformer model from [Bria](https://huggingface.co/briaai/FIBO) ## BriaFiboTransformer2DModel [[autodoc]] BriaFiboTransformer2DModel ================================================ FILE: docs/source/en/api/models/transformer_temporal.md ================================================ # TransformerTemporalModel A Transformer model for video-like data. ## TransformerTemporalModel [[autodoc]] models.transformers.transformer_temporal.TransformerTemporalModel ## TransformerTemporalModelOutput [[autodoc]] models.transformers.transformer_temporal.TransformerTemporalModelOutput ================================================ FILE: docs/source/en/api/models/unet-motion.md ================================================ # UNetMotionModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. The abstract from the paper is: *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* ## UNetMotionModel [[autodoc]] UNetMotionModel ## UNet3DConditionOutput [[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput ================================================ FILE: docs/source/en/api/models/unet.md ================================================ # UNet1DModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model. The abstract from the paper is: *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* ## UNet1DModel [[autodoc]] UNet1DModel ## UNet1DOutput [[autodoc]] models.unets.unet_1d.UNet1DOutput ================================================ FILE: docs/source/en/api/models/unet2d-cond.md ================================================ # UNet2DConditionModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model. The abstract from the paper is: *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* ## UNet2DConditionModel [[autodoc]] UNet2DConditionModel ## UNet2DConditionOutput [[autodoc]] models.unets.unet_2d_condition.UNet2DConditionOutput ================================================ FILE: docs/source/en/api/models/unet2d.md ================================================ # UNet2DModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet model. The abstract from the paper is: *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* ## UNet2DModel [[autodoc]] UNet2DModel ## UNet2DOutput [[autodoc]] models.unets.unet_2d.UNet2DOutput ================================================ FILE: docs/source/en/api/models/unet3d-cond.md ================================================ # UNet3DConditionModel The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al. for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 3D UNet conditional model. The abstract from the paper is: *There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.* ## UNet3DConditionModel [[autodoc]] UNet3DConditionModel ## UNet3DConditionOutput [[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput ================================================ FILE: docs/source/en/api/models/uvit2d.md ================================================ # UVit2DModel The [U-ViT](https://hf.co/papers/2301.11093) model is a vision transformer (ViT) based UNet. This model incorporates elements from ViT (considers all inputs such as time, conditions and noisy image patches as tokens) and a UNet (long skip connections between the shallow and deep layers). The skip connection is important for predicting pixel-level features. An additional 3x3 convolutional block is applied prior to the final output to improve image quality. The abstract from the paper is: *Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.* ## UVit2DModel [[autodoc]] UVit2DModel ## UVit2DConvEmbed [[autodoc]] models.unets.uvit_2d.UVit2DConvEmbed ## UVitBlock [[autodoc]] models.unets.uvit_2d.UVitBlock ## ConvNextBlock [[autodoc]] models.unets.uvit_2d.ConvNextBlock ## ConvMlmLayer [[autodoc]] models.unets.uvit_2d.ConvMlmLayer ================================================ FILE: docs/source/en/api/models/vq.md ================================================ # VQModel The VQ-VAE model was introduced in [Neural Discrete Representation Learning](https://huggingface.co/papers/1711.00937) by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🤗 Diffusers to decode latent representations into images. Unlike [`AutoencoderKL`], the [`VQModel`] works in a quantized latent space. The abstract from the paper is: *Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" -- where the latents are ignored when they are paired with a powerful autoregressive decoder -- typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.* ## VQModel [[autodoc]] VQModel ## VQEncoderOutput [[autodoc]] models.autoencoders.vq_model.VQEncoderOutput ================================================ FILE: docs/source/en/api/models/wan_animate_transformer_3d.md ================================================ # WanAnimateTransformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [Wan Animate](https://github.com/Wan-Video/Wan2.2) by the Alibaba Wan Team. The model can be loaded with the following code snippet. ```python from diffusers import WanAnimateTransformer3DModel transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## WanAnimateTransformer3DModel [[autodoc]] WanAnimateTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/wan_transformer_3d.md ================================================ # WanTransformer3DModel A Diffusion Transformer model for 3D video-like data was introduced in [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team. The model can be loaded with the following code snippet. ```python from diffusers import WanTransformer3DModel transformer = WanTransformer3DModel.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) ``` ## WanTransformer3DModel [[autodoc]] WanTransformer3DModel ## Transformer2DModelOutput [[autodoc]] models.modeling_outputs.Transformer2DModelOutput ================================================ FILE: docs/source/en/api/models/z_image_transformer2d.md ================================================ # ZImageTransformer2DModel A Transformer model for image-like data from [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo). ## ZImageTransformer2DModel [[autodoc]] ZImageTransformer2DModel ================================================ FILE: docs/source/en/api/modular_diffusers/guiders.md ================================================ # Guiders Guiders are components in Modular Diffusers that control how the diffusion process is guided during generation. They implement various guidance techniques to improve generation quality and control. ## BaseGuidance [[autodoc]] diffusers.guiders.guider_utils.BaseGuidance ## ClassifierFreeGuidance [[autodoc]] diffusers.guiders.classifier_free_guidance.ClassifierFreeGuidance ## ClassifierFreeZeroStarGuidance [[autodoc]] diffusers.guiders.classifier_free_zero_star_guidance.ClassifierFreeZeroStarGuidance ## SkipLayerGuidance [[autodoc]] diffusers.guiders.skip_layer_guidance.SkipLayerGuidance ## SmoothedEnergyGuidance [[autodoc]] diffusers.guiders.smoothed_energy_guidance.SmoothedEnergyGuidance ## PerturbedAttentionGuidance [[autodoc]] diffusers.guiders.perturbed_attention_guidance.PerturbedAttentionGuidance ## AdaptiveProjectedGuidance [[autodoc]] diffusers.guiders.adaptive_projected_guidance.AdaptiveProjectedGuidance ## AutoGuidance [[autodoc]] diffusers.guiders.auto_guidance.AutoGuidance ## TangentialClassifierFreeGuidance [[autodoc]] diffusers.guiders.tangential_classifier_free_guidance.TangentialClassifierFreeGuidance ================================================ FILE: docs/source/en/api/modular_diffusers/pipeline.md ================================================ # Pipeline ## ModularPipeline [[autodoc]] diffusers.modular_pipelines.modular_pipeline.ModularPipeline ================================================ FILE: docs/source/en/api/modular_diffusers/pipeline_blocks.md ================================================ # Pipeline blocks ## ModularPipelineBlocks [[autodoc]] diffusers.modular_pipelines.modular_pipeline.ModularPipelineBlocks ## SequentialPipelineBlocks [[autodoc]] diffusers.modular_pipelines.modular_pipeline.SequentialPipelineBlocks ## LoopSequentialPipelineBlocks [[autodoc]] diffusers.modular_pipelines.modular_pipeline.LoopSequentialPipelineBlocks ## AutoPipelineBlocks [[autodoc]] diffusers.modular_pipelines.modular_pipeline.AutoPipelineBlocks ## ConditionalPipelineBlocks [[autodoc]] diffusers.modular_pipelines.modular_pipeline.ConditionalPipelineBlocks ================================================ FILE: docs/source/en/api/modular_diffusers/pipeline_components.md ================================================ # Components and configs ## ComponentSpec [[autodoc]] diffusers.modular_pipelines.modular_pipeline.ComponentSpec ## ConfigSpec [[autodoc]] diffusers.modular_pipelines.modular_pipeline.ConfigSpec ## ComponentsManager [[autodoc]] diffusers.modular_pipelines.components_manager.ComponentsManager ## InsertableDict [[autodoc]] diffusers.modular_pipelines.modular_pipeline_utils.InsertableDict ================================================ FILE: docs/source/en/api/modular_diffusers/pipeline_states.md ================================================ # Pipeline states ## PipelineState [[autodoc]] diffusers.modular_pipelines.modular_pipeline.PipelineState ## BlockState [[autodoc]] diffusers.modular_pipelines.modular_pipeline.BlockState ================================================ FILE: docs/source/en/api/normalization.md ================================================ # Normalization layers Customized normalization layers for supporting various models in 🤗 Diffusers. ## AdaLayerNorm [[autodoc]] models.normalization.AdaLayerNorm ## AdaLayerNormZero [[autodoc]] models.normalization.AdaLayerNormZero ## AdaLayerNormSingle [[autodoc]] models.normalization.AdaLayerNormSingle ## AdaGroupNorm [[autodoc]] models.normalization.AdaGroupNorm ## AdaLayerNormContinuous [[autodoc]] models.normalization.AdaLayerNormContinuous ## RMSNorm [[autodoc]] models.normalization.RMSNorm ## GlobalResponseNorm [[autodoc]] models.normalization.GlobalResponseNorm ## LuminaLayerNormContinuous [[autodoc]] models.normalization.LuminaLayerNormContinuous ## SD35AdaLayerNormZeroX [[autodoc]] models.normalization.SD35AdaLayerNormZeroX ## AdaLayerNormZeroSingle [[autodoc]] models.normalization.AdaLayerNormZeroSingle ## LuminaRMSNormZero [[autodoc]] models.normalization.LuminaRMSNormZero ## LpNorm [[autodoc]] models.normalization.LpNorm ## CogView3PlusAdaLayerNormZeroTextImage [[autodoc]] models.normalization.CogView3PlusAdaLayerNormZeroTextImage ## CogVideoXLayerNormZero [[autodoc]] models.normalization.CogVideoXLayerNormZero ## MochiRMSNormZero [[autodoc]] models.transformers.transformer_mochi.MochiRMSNormZero ## MochiRMSNorm [[autodoc]] models.normalization.MochiRMSNorm ================================================ FILE: docs/source/en/api/outputs.md ================================================ # Outputs All model outputs are subclasses of [`~utils.BaseOutput`], data structures containing all the information returned by the model. The outputs can also be used as tuples or dictionaries. For example: ```python from diffusers import DDIMPipeline pipeline = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32") outputs = pipeline() ``` The `outputs` object is a [`~pipelines.ImagePipelineOutput`] which means it has an image attribute. You can access each attribute as you normally would or with a keyword lookup, and if that attribute is not returned by the model, you will get `None`: ```python outputs.images outputs["images"] ``` When considering the `outputs` object as a tuple, it only considers the attributes that don't have `None` values. For instance, retrieving an image by indexing into it returns the tuple `(outputs.images)`: ```python outputs[:1] ``` > [!TIP] > To check a specific pipeline or model output, refer to its corresponding API documentation. ## BaseOutput [[autodoc]] utils.BaseOutput - to_tuple ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ## AudioPipelineOutput [[autodoc]] pipelines.AudioPipelineOutput ## ImageTextPipelineOutput [[autodoc]] ImageTextPipelineOutput ================================================ FILE: docs/source/en/api/parallel.md ================================================ # Parallelism Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more. ## ParallelConfig [[autodoc]] ParallelConfig ## ContextParallelConfig [[autodoc]] ContextParallelConfig [[autodoc]] hooks.apply_context_parallel ================================================ FILE: docs/source/en/api/pipelines/allegro.md ================================================ # Allegro [Allegro: Open the Black Box of Commercial-Level Video Generation Model](https://huggingface.co/papers/2410.15458) from RhymesAI, by Yuan Zhou, Qiuyue Wang, Yuxuan Cai, Huan Yang. The abstract from the paper is: *Significant advancements have been made in the field of video generation, with the open-source community contributing a wealth of research papers and tools for training high-quality models. However, despite these efforts, the available information and resources remain insufficient for achieving commercial-level performance. In this report, we open the black box and introduce Allegro, an advanced video generation model that excels in both quality and temporal consistency. We also highlight the current limitations in the field and present a comprehensive methodology for training high-performance, commercial-level video generation models, addressing key aspects such as data, model architecture, training pipeline, and evaluation. Our user study shows that Allegro surpasses existing open-source models and most commercial models, ranking just behind Hailuo and Kling. Code: https://github.com/rhymes-ai/Allegro , Model: https://huggingface.co/rhymes-ai/Allegro , Gallery: https://rhymes.ai/allegro_gallery .* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AllegroPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AllegroTransformer3DModel, AllegroPipeline from diffusers.utils import export_to_video from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "rhymes-ai/Allegro", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = AllegroTransformer3DModel.from_pretrained( "rhymes-ai/Allegro", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = AllegroPipeline.from_pretrained( "rhymes-ai/Allegro", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = ( "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, " "the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this " "location might be a popular spot for docking fishing boats." ) video = pipeline(prompt, guidance_scale=7.5, max_sequence_length=512).frames[0] export_to_video(video, "harbor.mp4", fps=15) ``` ## AllegroPipeline [[autodoc]] AllegroPipeline - all - __call__ ## AllegroPipelineOutput [[autodoc]] pipelines.allegro.pipeline_output.AllegroPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/amused.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # aMUSEd aMUSEd was introduced in [aMUSEd: An Open MUSE Reproduction](https://huggingface.co/papers/2401.01808) by Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. Amused is a lightweight text to image model based off of the [MUSE](https://huggingface.co/papers/2301.00704) architecture. Amused is particularly useful in applications that require a lightweight and fast model such as generating many images quickly at once. Amused is a vqvae token based transformer that can generate an image in fewer forward passes than many diffusion models. In contrast with muse, it uses the smaller text encoder CLIP-L/14 instead of t5-xxl. Due to its small parameter count and few forward pass generation process, amused can generate many images quickly. This benefit is seen particularly at larger batch sizes. The abstract from the paper is: *We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.* | Model | Params | |-------|--------| | [amused-256](https://huggingface.co/amused/amused-256) | 603M | | [amused-512](https://huggingface.co/amused/amused-512) | 608M | ## AmusedPipeline [[autodoc]] AmusedPipeline - __call__ - all - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention [[autodoc]] AmusedImg2ImgPipeline - __call__ - all - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention [[autodoc]] AmusedInpaintPipeline - __call__ - all - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ================================================ FILE: docs/source/en/api/pipelines/animatediff.md ================================================ # Text-to-Video Generation with AnimateDiff
LoRA
## Overview [AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://huggingface.co/papers/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai. The abstract of the paper is the following: *With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at [this https URL](https://animatediff.github.io/).* ## Available Pipelines | Pipeline | Tasks | Demo |---|---|:---:| | [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* | | [AnimateDiffControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_controlnet.py) | *Controlled Video-to-Video Generation with AnimateDiff using ControlNet* | | [AnimateDiffSparseControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sparsectrl.py) | *Controlled Video-to-Video Generation with AnimateDiff using SparseCtrl* | | [AnimateDiffSDXLPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_sdxl.py) | *Video-to-Video Generation with AnimateDiff* | | [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* | | [AnimateDiffVideoToVideoControlNetPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py) | *Video-to-Video Generation with AnimateDiff using ControlNet* | ## Available checkpoints Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5. ## Usage example ### AnimateDiffPipeline AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet. The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. ```python import torch from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter from diffusers.utils import export_to_gif # Load the motion adapter adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) # load SD 1.5 based finetuned model model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", beta_schedule="linear", steps_offset=1, ) pipe.scheduler = scheduler # enable memory savings pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt=( "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " "orange sky, warm lighting, fishing boats, ocean waves seagulls, " "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " "golden hour, coastal landscape, seaside scenery" ), negative_prompt="bad quality, worse quality", num_frames=16, guidance_scale=7.5, num_inference_steps=25, generator=torch.Generator("cpu").manual_seed(42), ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ``` Here are some sample outputs:
masterpiece, bestquality, sunset.
masterpiece, bestquality, sunset
> [!TIP] > AnimateDiff tends to work better with finetuned Stable Diffusion models. If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the AnimateDiff checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. ### AnimateDiffControlNetPipeline AnimateDiff can also be used with ControlNets ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide depth maps, the ControlNet model generates a video that'll preserve the spatial information from the depth maps. It is a more flexible and accurate way to control the video generation process. ```python import torch from diffusers import AnimateDiffControlNetPipeline, AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler from diffusers.utils import export_to_gif, load_video # Additionally, you will need a preprocess videos before they can be used with the ControlNet # HF maintains just the right package for it: `pip install controlnet_aux` from controlnet_aux.processor import ZoeDetector # Download controlnets from https://huggingface.co/lllyasviel/ControlNet-v1-1 to use .from_single_file # Download Diffusers-format controlnets, such as https://huggingface.co/lllyasviel/sd-controlnet-depth, to use .from_pretrained() controlnet = ControlNetModel.from_single_file("control_v11f1p_sd15_depth.pth", torch_dtype=torch.float16) # We use AnimateLCM for this example but one can use the original motion adapters as well (for example, https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-3) motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16) pipe: AnimateDiffControlNetPipeline = AnimateDiffControlNetPipeline.from_pretrained( "SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=motion_adapter, controlnet=controlnet, vae=vae, ).to(device="cuda", dtype=torch.float16) pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora") pipe.set_adapters(["lcm-lora"], [0.8]) depth_detector = ZoeDetector.from_pretrained("lllyasviel/Annotators").to("cuda") video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif") conditioning_frames = [] with pipe.progress_bar(total=len(video)) as progress_bar: for frame in video: conditioning_frames.append(depth_detector(frame)) progress_bar.update() prompt = "a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality" negative_prompt = "bad quality, worst quality" video = pipe( prompt=prompt, negative_prompt=negative_prompt, num_frames=len(video), num_inference_steps=10, guidance_scale=2.0, conditioning_frames=conditioning_frames, generator=torch.Generator().manual_seed(42), ).frames[0] export_to_gif(video, "animatediff_controlnet.gif", fps=8) ``` Here are some sample outputs:
Source Video Output Video
raccoon playing a guitar
racoon playing a guitar
a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality
a panda, playing a guitar, sitting in a pink boat, in the ocean, mountains in background, realistic, high quality
### AnimateDiffSparseControlNetPipeline [SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models](https://huggingface.co/papers/2311.16933) for achieving controlled generation in text-to-video diffusion models by Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. The abstract from the paper is: *The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at [this https URL](https://guoyww.github.io/projects/SparseCtrl).* SparseCtrl introduces the following checkpoints for controlled text-to-video generation: - [SparseCtrl Scribble](https://huggingface.co/guoyww/animatediff-sparsectrl-scribble) - [SparseCtrl RGB](https://huggingface.co/guoyww/animatediff-sparsectrl-rgb) #### Using SparseCtrl Scribble ```python import torch from diffusers import AnimateDiffSparseControlNetPipeline from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel from diffusers.schedulers import DPMSolverMultistepScheduler from diffusers.utils import export_to_gif, load_image model_id = "SG161222/Realistic_Vision_V5.1_noVAE" motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3" controlnet_id = "guoyww/animatediff-sparsectrl-scribble" lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3" vae_id = "stabilityai/sd-vae-ft-mse" device = "cuda" motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device) controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device) vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device) scheduler = DPMSolverMultistepScheduler.from_pretrained( model_id, subfolder="scheduler", beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True, ) pipe = AnimateDiffSparseControlNetPipeline.from_pretrained( model_id, motion_adapter=motion_adapter, controlnet=controlnet, vae=vae, scheduler=scheduler, torch_dtype=torch.float16, ).to(device) pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora") pipe.fuse_lora(lora_scale=1.0) prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality" negative_prompt = "low quality, worst quality, letterboxed" image_files = [ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png" ] condition_frame_indices = [0, 8, 15] conditioning_frames = [load_image(img_file) for img_file in image_files] video = pipe( prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=25, conditioning_frames=conditioning_frames, controlnet_conditioning_scale=1.0, controlnet_frame_indices=condition_frame_indices, generator=torch.Generator().manual_seed(1337), ).frames[0] export_to_gif(video, "output.gif") ``` Here are some sample outputs:
an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality
scribble-1
scribble-2
scribble-3
an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality
#### Using SparseCtrl RGB ```python import torch from diffusers import AnimateDiffSparseControlNetPipeline from diffusers.models import AutoencoderKL, MotionAdapter, SparseControlNetModel from diffusers.schedulers import DPMSolverMultistepScheduler from diffusers.utils import export_to_gif, load_image model_id = "SG161222/Realistic_Vision_V5.1_noVAE" motion_adapter_id = "guoyww/animatediff-motion-adapter-v1-5-3" controlnet_id = "guoyww/animatediff-sparsectrl-rgb" lora_adapter_id = "guoyww/animatediff-motion-lora-v1-5-3" vae_id = "stabilityai/sd-vae-ft-mse" device = "cuda" motion_adapter = MotionAdapter.from_pretrained(motion_adapter_id, torch_dtype=torch.float16).to(device) controlnet = SparseControlNetModel.from_pretrained(controlnet_id, torch_dtype=torch.float16).to(device) vae = AutoencoderKL.from_pretrained(vae_id, torch_dtype=torch.float16).to(device) scheduler = DPMSolverMultistepScheduler.from_pretrained( model_id, subfolder="scheduler", beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True, ) pipe = AnimateDiffSparseControlNetPipeline.from_pretrained( model_id, motion_adapter=motion_adapter, controlnet=controlnet, vae=vae, scheduler=scheduler, torch_dtype=torch.float16, ).to(device) pipe.load_lora_weights(lora_adapter_id, adapter_name="motion_lora") image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-firework.png") video = pipe( prompt="closeup face photo of man in black clothes, night city street, bokeh, fireworks in background", negative_prompt="low quality, worst quality", num_inference_steps=25, conditioning_frames=image, controlnet_frame_indices=[0], controlnet_conditioning_scale=1.0, generator=torch.Generator().manual_seed(42), ).frames[0] export_to_gif(video, "output.gif") ``` Here are some sample outputs:
closeup face photo of man in black clothes, night city street, bokeh, fireworks in background
closeup face photo of man in black clothes, night city street, bokeh, fireworks in background
closeup face photo of man in black clothes, night city street, bokeh, fireworks in background
### AnimateDiffSDXLPipeline AnimateDiff can also be used with SDXL models. This is currently an experimental feature as only a beta release of the motion adapter checkpoint is available. ```python import torch from diffusers.models import MotionAdapter from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler from diffusers.utils import export_to_gif adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16) model_id = "stabilityai/stable-diffusion-xl-base-1.0" scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", beta_schedule="linear", steps_offset=1, ) pipe = AnimateDiffSDXLPipeline.from_pretrained( model_id, motion_adapter=adapter, scheduler=scheduler, torch_dtype=torch.float16, variant="fp16", ).to("cuda") # enable memory savings pipe.enable_vae_slicing() pipe.enable_vae_tiling() output = pipe( prompt="a panda surfing in the ocean, realistic, high quality", negative_prompt="low quality, worst quality", num_inference_steps=20, guidance_scale=8, width=1024, height=1024, num_frames=16, ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ``` ### AnimateDiffVideoToVideoPipeline AnimateDiff can also be used to generate visually similar videos or enable style/character/background or other edits starting from an initial video, allowing you to seamlessly explore creative possibilities. ```python import imageio import requests import torch from diffusers import AnimateDiffVideoToVideoPipeline, DDIMScheduler, MotionAdapter from diffusers.utils import export_to_gif from io import BytesIO from PIL import Image # Load the motion adapter adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) # load SD 1.5 based finetuned model model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", beta_schedule="linear", steps_offset=1, ) pipe.scheduler = scheduler # enable memory savings pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() # helper function to load videos def load_video(file_path: str): images = [] if file_path.startswith(('http://', 'https://')): # If the file_path is a URL response = requests.get(file_path) response.raise_for_status() content = BytesIO(response.content) vid = imageio.get_reader(content) else: # Assuming it's a local file path vid = imageio.get_reader(file_path) for frame in vid: pil_image = Image.fromarray(frame) images.append(pil_image) return images video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif") output = pipe( video = video, prompt="panda playing a guitar, on a boat, in the ocean, high quality", negative_prompt="bad quality, worse quality", guidance_scale=7.5, num_inference_steps=25, strength=0.5, generator=torch.Generator("cpu").manual_seed(42), ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ``` Here are some sample outputs:
Source Video Output Video
raccoon playing a guitar
racoon playing a guitar
panda playing a guitar
panda playing a guitar
closeup of margot robbie, fireworks in the background, high quality
closeup of margot robbie, fireworks in the background, high quality
closeup of tony stark, robert downey jr, fireworks
closeup of tony stark, robert downey jr, fireworks
### AnimateDiffVideoToVideoControlNetPipeline AnimateDiff can be used together with ControlNets to enhance video-to-video generation by allowing for precise control over the output. ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, and allows you to condition Stable Diffusion with an additional control image to ensure that the spatial information is preserved throughout the video. This pipeline allows you to condition your generation both on the original video and on a sequence of control images. ```python import torch from PIL import Image from tqdm.auto import tqdm from controlnet_aux.processor import OpenposeDetector from diffusers import AnimateDiffVideoToVideoControlNetPipeline from diffusers.utils import export_to_gif, load_video from diffusers import AutoencoderKL, ControlNetModel, MotionAdapter, LCMScheduler # Load the ControlNet controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) # Load the motion adapter motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") # Load SD 1.5 based finetuned model vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16) pipe = AnimateDiffVideoToVideoControlNetPipeline.from_pretrained( "SG161222/Realistic_Vision_V5.1_noVAE", motion_adapter=motion_adapter, controlnet=controlnet, vae=vae, ).to(device="cuda", dtype=torch.float16) # Enable LCM to speed up inference pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm-lora") pipe.set_adapters(["lcm-lora"], [0.8]) video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/dance.gif") video = [frame.convert("RGB") for frame in video] prompt = "astronaut in space, dancing" negative_prompt = "bad quality, worst quality, jpeg artifacts, ugly" # Create controlnet preprocessor open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators").to("cuda") # Preprocess controlnet images conditioning_frames = [] for frame in tqdm(video): conditioning_frames.append(open_pose(frame)) strength = 0.8 with torch.inference_mode(): video = pipe( video=video, prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=10, guidance_scale=2.0, controlnet_conditioning_scale=0.75, conditioning_frames=conditioning_frames, strength=strength, generator=torch.Generator().manual_seed(42), ).frames[0] video = [frame.resize(conditioning_frames[0].size) for frame in video] export_to_gif(video, f"animatediff_vid2vid_controlnet.gif", fps=8) ``` Here are some sample outputs:
Source Video Output Video
anime girl, dancing
anime girl, dancing
astronaut in space, dancing
astronaut in space, dancing
**The lights and composition were transferred from the Source Video.** ## Using Motion LoRAs Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations. ```python import torch from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter from diffusers.utils import export_to_gif # Load the motion adapter adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) # load SD 1.5 based finetuned model model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) pipe.load_lora_weights( "guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out" ) scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, beta_schedule="linear", timestep_spacing="linspace", steps_offset=1, ) pipe.scheduler = scheduler # enable memory savings pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt=( "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " "orange sky, warm lighting, fishing boats, ocean waves seagulls, " "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " "golden hour, coastal landscape, seaside scenery" ), negative_prompt="bad quality, worse quality", num_frames=16, guidance_scale=7.5, num_inference_steps=25, generator=torch.Generator("cpu").manual_seed(42), ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ```
masterpiece, bestquality, sunset.
masterpiece, bestquality, sunset
## Using Motion LoRAs with PEFT You can also leverage the [PEFT](https://github.com/huggingface/peft) backend to combine Motion LoRA's and create more complex animations. First install PEFT with ```shell pip install peft ``` Then you can use the following code to combine Motion LoRAs. ```python import torch from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter from diffusers.utils import export_to_gif # Load the motion adapter adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) # load SD 1.5 based finetuned model model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16) pipe.load_lora_weights( "diffusers/animatediff-motion-lora-zoom-out", adapter_name="zoom-out", ) pipe.load_lora_weights( "diffusers/animatediff-motion-lora-pan-left", adapter_name="pan-left", ) pipe.set_adapters(["zoom-out", "pan-left"], adapter_weights=[1.0, 1.0]) scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", beta_schedule="linear", steps_offset=1, ) pipe.scheduler = scheduler # enable memory savings pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt=( "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, " "orange sky, warm lighting, fishing boats, ocean waves seagulls, " "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, " "golden hour, coastal landscape, seaside scenery" ), negative_prompt="bad quality, worse quality", num_frames=16, guidance_scale=7.5, num_inference_steps=25, generator=torch.Generator("cpu").manual_seed(42), ) frames = output.frames[0] export_to_gif(frames, "animation.gif") ```
masterpiece, bestquality, sunset.
masterpiece, bestquality, sunset
## Using FreeInit [FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://huggingface.co/papers/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper. The following example demonstrates the usage of FreeInit. ```python import torch from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler from diffusers.utils import export_to_gif adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2") model_id = "SG161222/Realistic_Vision_V5.1_noVAE" pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16).to("cuda") pipe.scheduler = DDIMScheduler.from_pretrained( model_id, subfolder="scheduler", beta_schedule="linear", clip_sample=False, timestep_spacing="linspace", steps_offset=1 ) # enable memory savings pipe.enable_vae_slicing() pipe.enable_vae_tiling() # enable FreeInit # Refer to the enable_free_init documentation for a full list of configurable parameters pipe.enable_free_init(method="butterworth", use_fast_sampling=True) # run inference output = pipe( prompt="a panda playing a guitar, on a boat, in the ocean, high quality", negative_prompt="bad quality, worse quality", num_frames=16, guidance_scale=7.5, num_inference_steps=20, generator=torch.Generator("cpu").manual_seed(666), ) # disable FreeInit pipe.disable_free_init() frames = output.frames[0] export_to_gif(frames, "animation.gif") ``` > [!WARNING] > FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
Without FreeInit enabled With FreeInit enabled
panda playing a guitar
panda playing a guitar
panda playing a guitar
panda playing a guitar
## Using AnimateLCM [AnimateLCM](https://animatelcm.github.io/) is a motion module checkpoint and an [LCM LoRA](https://huggingface.co/docs/diffusers/using-diffusers/inference_with_lcm_lora) that have been created using a consistency learning strategy that decouples the distillation of the image generation priors and the motion generation priors. ```python import torch from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter from diffusers.utils import export_to_gif adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora") pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", negative_prompt="bad quality, worse quality, low resolution", num_frames=16, guidance_scale=1.5, num_inference_steps=6, generator=torch.Generator("cpu").manual_seed(0), ) frames = output.frames[0] export_to_gif(frames, "animatelcm.gif") ```
A space rocket, 4K.
A space rocket, 4K
AnimateLCM is also compatible with existing [Motion LoRAs](https://huggingface.co/collections/dn6/animatediff-motion-loras-654cb8ad732b9e3cf4d3c17e). ```python import torch from diffusers import AnimateDiffPipeline, LCMScheduler, MotionAdapter from diffusers.utils import export_to_gif adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM") pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") pipe.load_lora_weights("wangfuyun/AnimateLCM", weight_name="sd15_lora_beta.safetensors", adapter_name="lcm-lora") pipe.load_lora_weights("guoyww/animatediff-motion-lora-tilt-up", adapter_name="tilt-up") pipe.set_adapters(["lcm-lora", "tilt-up"], [1.0, 0.8]) pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() output = pipe( prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution", negative_prompt="bad quality, worse quality, low resolution", num_frames=16, guidance_scale=1.5, num_inference_steps=6, generator=torch.Generator("cpu").manual_seed(0), ) frames = output.frames[0] export_to_gif(frames, "animatelcm-motion-lora.gif") ```
A space rocket, 4K.
A space rocket, 4K
## Using FreeNoise [FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://huggingface.co/papers/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu. FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. The currently supported AnimateDiff pipelines that can be used with FreeNoise are: - [`AnimateDiffPipeline`] - [`AnimateDiffControlNetPipeline`] - [`AnimateDiffVideoToVideoPipeline`] - [`AnimateDiffVideoToVideoControlNetPipeline`] In order to use FreeNoise, a single line needs to be added to the inference code after loading your pipelines. ```diff + pipe.enable_free_noise() ``` After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used. However, you can customize this behaviour with a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise. Full example: ```python import torch from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter from diffusers.utils import export_to_video, load_image # Load pipeline dtype = torch.float16 motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype) vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype) pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype) pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") pipe.load_lora_weights( "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora" ) pipe.set_adapters(["lcm_lora"], [0.8]) # Enable FreeNoise for long prompt generation pipe.enable_free_noise(context_length=16, context_stride=4) pipe.to("cuda") # Can be a single prompt, or a dictionary with frame timesteps prompt = { 0: "A caterpillar on a leaf, high quality, photorealistic", 40: "A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic", 80: "A cocoon on a leaf, flowers in the background, photorealistic", 120: "A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic", 160: "A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic", 200: "A beautiful butterfly, flying away in a forest, photorealistic", 240: "A cyberpunk butterfly, neon lights, glowing", } negative_prompt = "bad quality, worst quality, jpeg artifacts" # Run inference output = pipe( prompt=prompt, negative_prompt=negative_prompt, num_frames=256, guidance_scale=2.5, num_inference_steps=10, generator=torch.Generator("cpu").manual_seed(0), ) # Save video frames = output.frames[0] export_to_video(frames, "output.mp4", fps=16) ``` ### FreeNoise memory savings Since FreeNoise processes multiple frames together, there are parts in the modeling where the memory required exceeds that available on normal consumer GPUs. The main memory bottlenecks that we identified are spatial and temporal attention blocks, upsampling and downsampling blocks, resnet blocks and feed-forward layers. Since most of these blocks operate effectively only on the channel/embedding dimension, one can perform chunked inference across the batch dimensions. The batch dimension in AnimateDiff are either spatial (`[B x F, H x W, C]`) or temporal (`B x H x W, F, C`) in nature (note that it may seem counter-intuitive, but the batch dimension here are correct, because spatial blocks process across the `B x F` dimension while the temporal blocks process across the `B x H x W` dimension). We introduce a `SplitInferenceModule` that makes it easier to chunk across any dimension and perform inference. This saves a lot of memory but comes at the cost of requiring more time for inference. ```diff # Load pipeline and adapters # ... + pipe.enable_free_noise_split_inference() + pipe.unet.enable_forward_chunking(16) ``` The call to `pipe.enable_free_noise_split_inference` method accepts two parameters: `spatial_split_size` (defaults to `256`) and `temporal_split_size` (defaults to `16`). These can be configured based on how much VRAM you have available. A lower split size results in lower memory usage but slower inference, whereas a larger split size results in faster inference at the cost of more memory. ## Using `from_single_file` with the MotionAdapter `diffusers>=0.30.0` supports loading the AnimateDiff checkpoints into the `MotionAdapter` in their original format via `from_single_file` ```python from diffusers import MotionAdapter ckpt_path = "https://huggingface.co/Lightricks/LongAnimateDiff/blob/main/lt_long_mm_32_frames.ckpt" adapter = MotionAdapter.from_single_file(ckpt_path, torch_dtype=torch.float16) pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter) ``` ## AnimateDiffPipeline [[autodoc]] AnimateDiffPipeline - all - __call__ ## AnimateDiffControlNetPipeline [[autodoc]] AnimateDiffControlNetPipeline - all - __call__ ## AnimateDiffSparseControlNetPipeline [[autodoc]] AnimateDiffSparseControlNetPipeline - all - __call__ ## AnimateDiffSDXLPipeline [[autodoc]] AnimateDiffSDXLPipeline - all - __call__ ## AnimateDiffVideoToVideoPipeline [[autodoc]] AnimateDiffVideoToVideoPipeline - all - __call__ ## AnimateDiffVideoToVideoControlNetPipeline [[autodoc]] AnimateDiffVideoToVideoControlNetPipeline - all - __call__ ## AnimateDiffPipelineOutput [[autodoc]] pipelines.animatediff.AnimateDiffPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/attend_and_excite.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Attend-and-Excite Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://attendandexcite.github.io/Attend-and-Excite/) and provides textual attention control over image generation. The abstract from the paper is: *Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.* You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionAttendAndExcitePipeline [[autodoc]] StableDiffusionAttendAndExcitePipeline - all - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/audioldm.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # AudioLDM AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. The abstract from the paper is: *Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at [this https URL](https://audioldm.github.io/).* The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). ## Tips When constructing a prompt, keep in mind: * Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream"). * It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. During inference: * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## AudioLDMPipeline [[autodoc]] AudioLDMPipeline - all - __call__ ## AudioPipelineOutput [[autodoc]] pipelines.AudioPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/audioldm2.md ================================================ # AudioLDM 2 AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://huggingface.co/papers/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs. The abstract of the paper is the following: *Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).* This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi) and [Nguyễn Công Tú Anh](https://github.com/tuanh123789). The original codebase can be found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). ## Tips ### Choosing a checkpoint AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on the three checkpoints: | Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h | |-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------| | [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k | | [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k | | [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k | | [audioldm2-gigaspeech](https://huggingface.co/anhnct/audioldm2_gigaspeech) | Text-to-speech | 350M | 1.1B |10k | | [audioldm2-ljspeech](https://huggingface.co/anhnct/audioldm2_ljspeech) | Text-to-speech | 350M | 1.1B | | ### Constructing a prompt * Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream"). * It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. * Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." ### Controlling inference * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument. ### Evaluating generated waveforms: * The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation. * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. The following example demonstrates how to construct good music and speech generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## AudioLDM2Pipeline [[autodoc]] AudioLDM2Pipeline - all - __call__ ## AudioLDM2ProjectionModel [[autodoc]] AudioLDM2ProjectionModel - forward ## AudioLDM2UNet2DConditionModel [[autodoc]] AudioLDM2UNet2DConditionModel - forward ## AudioPipelineOutput [[autodoc]] pipelines.AudioPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/aura_flow.md ================================================ # AuraFlow AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark. It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/). > [!TIP] > AuraFlow can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AuraFlowPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AuraFlowTransformer2DModel, AuraFlowPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "fal/AuraFlow", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = AuraFlowTransformer2DModel.from_pretrained( "fal/AuraFlow", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = AuraFlowPipeline.from_pretrained( "fal/AuraFlow", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt).images[0] image.save("auraflow.png") ``` Loading [GGUF checkpoints](https://huggingface.co/docs/diffusers/quantization/gguf) are also supported: ```py import torch from diffusers import ( AuraFlowPipeline, GGUFQuantizationConfig, AuraFlowTransformer2DModel, ) transformer = AuraFlowTransformer2DModel.from_single_file( "https://huggingface.co/city96/AuraFlow-v0.3-gguf/blob/main/aura_flow_0.3-Q2_K.gguf", quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipeline = AuraFlowPipeline.from_pretrained( "fal/AuraFlow-v0.3", transformer=transformer, torch_dtype=torch.bfloat16, ) prompt = "a cute pony in a field of flowers" image = pipeline(prompt).images[0] image.save("auraflow.png") ``` ## Support for `torch.compile()` AuraFlow can be compiled with `torch.compile()` to speed up inference latency even for different resolutions. First, install PyTorch nightly following the instructions from [here](https://pytorch.org/). The snippet below shows the changes needed to enable this: ```diff + torch.fx.experimental._config.use_duck_shape = False + pipeline.transformer = torch.compile( pipeline.transformer, fullgraph=True, dynamic=True ) ``` Specifying `use_duck_shape` to be `False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out [this comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790). This enables from 100% (on low resolutions) to a 30% (on 1536x1536 resolution) speed improvements. Thanks to [AstraliteHeart](https://github.com/huggingface/diffusers/pull/11297/) who helped us rewrite the [`AuraFlowTransformer2DModel`] class so that the above works for different resolutions ([PR](https://github.com/huggingface/diffusers/pull/11297/)). ## AuraFlowPipeline [[autodoc]] AuraFlowPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/auto_pipeline.md ================================================ # AutoPipeline The `AutoPipeline` is designed to make it easy to load a checkpoint for a task without needing to know the specific pipeline class. Based on the task, the `AutoPipeline` automatically retrieves the correct pipeline class from the checkpoint `model_index.json` file. > [!TIP] > Check out the [AutoPipeline](../../tutorials/autopipeline) tutorial to learn how to use this API! ## AutoPipelineForText2Image [[autodoc]] AutoPipelineForText2Image - all - from_pretrained - from_pipe ## AutoPipelineForImage2Image [[autodoc]] AutoPipelineForImage2Image - all - from_pretrained - from_pipe ## AutoPipelineForInpainting [[autodoc]] AutoPipelineForInpainting - all - from_pretrained - from_pipe ================================================ FILE: docs/source/en/api/pipelines/blip_diffusion.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # BLIP-Diffusion BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. The abstract from the paper is: *Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).* The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization. `BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## BlipDiffusionPipeline [[autodoc]] BlipDiffusionPipeline - all - __call__ ## BlipDiffusionControlNetPipeline [[autodoc]] BlipDiffusionControlNetPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/bria_3_2.md ================================================ # Bria 3.2 Bria 3.2 is the next-generation commercial-ready text-to-image model. With just 4 billion parameters, it provides exceptional aesthetics and text rendering, evaluated to provide on par results to leading open-source models, and outperforming other licensed models. In addition to being built entirely on licensed data, 3.2 provides several advantages for enterprise and commercial use: - Efficient Compute - the model is X3 smaller than the equivalent models in the market (4B parameters vs 12B parameters other open source models) - Architecture Consistency: Same architecture as 3.1—ideal for users looking to upgrade without disruption. - Fine-tuning Speedup: 2x faster fine-tuning on L40S and A100. Original model checkpoints for Bria 3.2 can be found [here](https://huggingface.co/briaai/BRIA-3.2). Github repo for Bria 3.2 can be found [here](https://github.com/Bria-AI/BRIA-3.2). If you want to learn more about the Bria platform, and get free traril access, please visit [bria.ai](https://bria.ai). ## Usage _As the model is gated, before using it with diffusers you first need to go to the [Bria 3.2 Hugging Face page](https://huggingface.co/briaai/BRIA-3.2), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ Use the command below to log in: ```bash hf auth login ``` ## BriaPipeline [[autodoc]] BriaPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/bria_fibo.md ================================================ # Bria Fibo Text-to-image models have mastered imagination - but not control. FIBO changes that. FIBO is trained on structured JSON captions up to 1,000+ words and designed to understand and control different visual parameters such as lighting, composition, color, and camera settings, enabling precise and reproducible outputs. With only 8 billion parameters, FIBO provides a new level of image quality, prompt adherence and proffesional control. FIBO is trained exclusively on a structured prompt and will not work with freeform text prompts. you can use the [FIBO-VLM-prompt-to-JSON](https://huggingface.co/briaai/FIBO-VLM-prompt-to-JSON) model or the [FIBO-gemini-prompt-to-JSON](https://huggingface.co/briaai/FIBO-gemini-prompt-to-JSON) to convert your freeform text prompt to a structured JSON prompt. > [!NOTE] > Avoid using freeform text prompts directly with FIBO because it does not produce the best results. Refer to the Bria Fibo Hugging Face [page](https://huggingface.co/briaai/FIBO) to learn more. ## Usage _As the model is gated, before using it with diffusers you first need to go to the [Bria Fibo Hugging Face page](https://huggingface.co/briaai/FIBO), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ Use the command below to log in: ```bash hf auth login ``` ## BriaFiboPipeline [[autodoc]] BriaFiboPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/bria_fibo_edit.md ================================================ # Bria Fibo Edit Fibo Edit is an 8B parameter image-to-image model that introduces a new paradigm of structured control, operating on JSON inputs paired with source images to enable deterministic and repeatable editing workflows. Featuring native masking for granular precision, it moves beyond simple prompt-based diffusion to offer explicit, interpretable control optimized for production environments. Its lightweight architecture is designed for deep customization, empowering researchers to build specialized "Edit" models for domain-specific tasks while delivering top-tier aesthetic quality ## Usage _As the model is gated, before using it with diffusers you first need to go to the [Bria Fibo Hugging Face page](https://huggingface.co/briaai/Fibo-Edit), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ Use the command below to log in: ```bash hf auth login ``` ## BriaFiboEditPipeline [[autodoc]] BriaFiboEditPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/chroma.md ================================================ # Chroma
LoRA MPS
Chroma is a text to image generation model based on Flux. Original model checkpoints for Chroma can be found here: * High-resolution finetune: [lodestones/Chroma1-HD](https://huggingface.co/lodestones/Chroma1-HD) * Base model: [lodestones/Chroma1-Base](https://huggingface.co/lodestones/Chroma1-Base) * Original repo with progress checkpoints: [lodestones/Chroma](https://huggingface.co/lodestones/Chroma) (loading this repo with `from_pretrained` will load a Diffusers-compatible version of the `unlocked-v37` checkpoint) > [!TIP] > Chroma can use all the same optimizations as Flux. ## Inference ```python import torch from diffusers import ChromaPipeline pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() prompt = [ "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done." ] negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"] image = pipe( prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator("cpu").manual_seed(433), num_inference_steps=40, guidance_scale=3.0, num_images_per_prompt=1, ).images[0] image.save("chroma.png") ``` ## Loading from a single file To use updated model checkpoints that are not in the Diffusers format, you can use the `ChromaTransformer2DModel` class to load the model from a single file in the original format. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community. The following example demonstrates how to run Chroma from a single file. Then run the following example ```python import torch from diffusers import ChromaTransformer2DModel, ChromaPipeline model_id = "lodestones/Chroma1-HD" dtype = torch.bfloat16 transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma1-HD/blob/main/Chroma1-HD.safetensors", torch_dtype=dtype) pipe = ChromaPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=dtype) pipe.enable_model_cpu_offload() prompt = [ "A high-fashion close-up portrait of a blonde woman in clear sunglasses. The image uses a bold teal and red color split for dramatic lighting. The background is a simple teal-green. The photo is sharp and well-composed, and is designed for viewing with anaglyph 3D glasses for optimal effect. It looks professionally done." ] negative_prompt = ["low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"] image = pipe( prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator("cpu").manual_seed(433), num_inference_steps=40, guidance_scale=3.0, ).images[0] image.save("chroma-single-file.png") ``` ## ChromaPipeline [[autodoc]] ChromaPipeline - all - __call__ ## ChromaImg2ImgPipeline [[autodoc]] ChromaImg2ImgPipeline - all - __call__ ## ChromaInpaintPipeline [[autodoc]] ChromaInpaintPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/chronoedit.md ================================================ # ChronoEdit [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling. > **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory. *Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: [this https URL](https://research.nvidia.com/labs/toronto-ai/chronoedit).* The ChronoEdit pipeline is developed by the ChronoEdit Team. The original code is available on [GitHub](https://github.com/nv-tlabs/ChronoEdit), and pretrained models can be found in the [nvidia/ChronoEdit](https://huggingface.co/collections/nvidia/chronoedit) collection on Hugging Face. Available Models/LoRAs: - [nvidia/ChronoEdit-14B-Diffusers](https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers) - [nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora](https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers-Upscaler-Lora) - [nvidia/ChronoEdit-14B-Diffusers-Paint-Brush-Lora](https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers-Paint-Brush-Lora) ### Image Editing ```py import torch import numpy as np from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline from diffusers.utils import export_to_video, load_image from transformers import CLIPVisionModel from PIL import Image model_id = "nvidia/ChronoEdit-14B-Diffusers" image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32) vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16) pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16) pipe.to("cuda") image = load_image( "https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png" ) max_area = 720 * 1280 aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value print("width", width, "height", height) image = image.resize((width, height)) prompt = ( "The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. " "The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood." ) output = pipe( image=image, prompt=prompt, height=height, width=width, num_frames=5, num_inference_steps=50, guidance_scale=5.0, enable_temporal_reasoning=False, num_temporal_reasoning_steps=0, ).frames[0] Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png") ``` Optionally, enable **temporal reasoning** for improved physical consistency: ```py output = pipe( image=image, prompt=prompt, height=height, width=width, num_frames=29, num_inference_steps=50, guidance_scale=5.0, enable_temporal_reasoning=True, num_temporal_reasoning_steps=50, ).frames[0] export_to_video(output, "output.mp4", fps=16) Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png") ``` ### Inference with 8-Step Distillation Lora ```py import torch import numpy as np from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline from diffusers.schedulers import UniPCMultistepScheduler from diffusers.utils import export_to_video, load_image from transformers import CLIPVisionModel from PIL import Image model_id = "nvidia/ChronoEdit-14B-Diffusers" image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32) vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16) pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16) pipe.load_lora_weights("nvidia/ChronoEdit-14B-Diffusers", weight_name="lora/chronoedit_distill_lora.safetensors", adapter_name="distill") pipe.fuse_lora(adapter_names=["distill"], lora_scale=1.0) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=2.0) pipe.to("cuda") image = load_image( "https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png" ) max_area = 720 * 1280 aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value print("width", width, "height", height) image = image.resize((width, height)) prompt = ( "The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. " "The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood." ) output = pipe( image=image, prompt=prompt, height=height, width=width, num_frames=5, num_inference_steps=8, guidance_scale=1.0, enable_temporal_reasoning=False, num_temporal_reasoning_steps=0, ).frames[0] export_to_video(output, "output.mp4", fps=16) Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png") ``` ### Inference with Multiple LoRAs ```py import torch import numpy as np from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline from diffusers.schedulers import UniPCMultistepScheduler from diffusers.utils import export_to_video, load_image from transformers import CLIPVisionModel from PIL import Image model_id = "nvidia/ChronoEdit-14B-Diffusers" image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32) vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16) pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16) pipe.load_lora_weights("nvidia/ChronoEdit-14B-Diffusers-Paint-Brush-Lora", weight_name="paintbrush_lora_diffusers.safetensors", adapter_name="paintbrush") pipe.load_lora_weights("nvidia/ChronoEdit-14B-Diffusers", weight_name="lora/chronoedit_distill_lora.safetensors", adapter_name="distill") pipe.fuse_lora(adapter_names=["paintbrush", "distill"], lora_scale=1.0) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=2.0) pipe.to("cuda") image = load_image( "https://raw.githubusercontent.com/nv-tlabs/ChronoEdit/refs/heads/main/assets/images/input_paintbrush.png" ) max_area = 720 * 1280 aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value print("width", width, "height", height) image = image.resize((width, height)) prompt = ( "Turn the pencil sketch in the image into an actual object that is consistent with the image’s content. The user wants to change the sketch to a crown and a hat." ) output = pipe( image=image, prompt=prompt, height=height, width=width, num_frames=5, num_inference_steps=8, guidance_scale=1.0, enable_temporal_reasoning=False, num_temporal_reasoning_steps=0, ).frames[0] export_to_video(output, "output.mp4", fps=16) Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output_1.png") ``` ## ChronoEditPipeline [[autodoc]] ChronoEditPipeline - all - __call__ ## ChronoEditPipelineOutput [[autodoc]] pipelines.chronoedit.pipeline_output.ChronoEditPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/cogvideox.md ================================================ # CogVideoX [CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos. You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection. > [!TIP] > Click on the CogVideoX models in the right sidebar for more examples of other video generation tasks. The example below demonstrates how to generate a video optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The quantized CogVideoX 5B model below requires ~16GB of VRAM. ```py import torch from diffusers import CogVideoXPipeline, AutoModel from diffusers.quantizers import PipelineQuantizationConfig from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video # quantize weights to int8 with torchao pipeline_quant_config = PipelineQuantizationConfig( quant_backend="torchao", quant_kwargs={"quant_type": "int8wo"}, components_to_quantize="transformer" ) # fp8 layerwise weight-casting transformer = AutoModel.from_pretrained( "THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16 ) transformer.enable_layerwise_casting( storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16 ) pipeline = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", transformer=transformer, quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") # model-offloading pipeline.enable_model_cpu_offload() prompt = """ A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting. """ video = pipeline( prompt=prompt, guidance_scale=6, num_inference_steps=50 ).frames[0] export_to_video(video, "output.mp4", fps=8) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. The average inference time with torch.compile on a 80GB A100 is 76.27 seconds compared to 96.89 seconds for an uncompiled model. ```py import torch from diffusers import CogVideoXPipeline from diffusers.utils import export_to_video pipeline = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-2b", torch_dtype=torch.float16 ).to("cuda") # torch.compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer = torch.compile( pipeline.transformer, mode="max-autotune", fullgraph=True ) prompt = """ A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting. """ video = pipeline( prompt=prompt, guidance_scale=6, num_inference_steps=50 ).frames[0] export_to_video(video, "output.mp4", fps=8) ``` ## Notes - CogVideoX supports LoRAs with [`~loaders.CogVideoXLoraLoaderMixin.load_lora_weights`].
Show example code ```py import torch from diffusers import CogVideoXPipeline from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video pipeline = CogVideoXPipeline.from_pretrained( "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16 ) pipeline.to("cuda") # load LoRA weights pipeline.load_lora_weights("finetrainers/CogVideoX-1.5-crush-smol-v0", adapter_name="crush-lora") pipeline.set_adapters("crush-lora", 0.9) # model-offloading pipeline.enable_model_cpu_offload() prompt = """ PIKA_CRUSH A large metal cylinder is seen pressing down on a pile of Oreo cookies, flattening them as if they were under a hydraulic press. """ negative_prompt = "inconsistent motion, blurry motion, worse quality, degenerate outputs, deformed outputs" video = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=81, height=480, width=768, num_inference_steps=50 ).frames[0] export_to_video(video, "output.mp4", fps=16) ```
- The text-to-video (T2V) checkpoints work best with a resolution of 1360x768 because that was the resolution it was pretrained on. - The image-to-video (I2V) checkpoints work with multiple resolutions. The width can vary from 768 to 1360, but the height must be 758. Both height and width must be divisible by 16. - Both T2V and I2V checkpoints work best with 81 and 161 frames. It is recommended to export the generated video at 16fps. - Refer to the table below to view memory usage when various memory-saving techniques are enabled. | method | memory usage (enabled) | memory usage (disabled) | |---|---|---| | enable_model_cpu_offload | 19GB | 33GB | | enable_sequential_cpu_offload | <4GB | ~33GB (very slow inference speed) | | enable_tiling | 11GB (with enable_model_cpu_offload) | --- | ## CogVideoXPipeline [[autodoc]] CogVideoXPipeline - all - __call__ ## CogVideoXImageToVideoPipeline [[autodoc]] CogVideoXImageToVideoPipeline - all - __call__ ## CogVideoXVideoToVideoPipeline [[autodoc]] CogVideoXVideoToVideoPipeline - all - __call__ ## CogVideoXFunControlPipeline [[autodoc]] CogVideoXFunControlPipeline - all - __call__ ## CogVideoXPipelineOutput [[autodoc]] pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/cogview3.md ================================================ # CogView3Plus [CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang. The abstract from the paper is: *Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). ## CogView3PlusPipeline [[autodoc]] CogView3PlusPipeline - all - __call__ ## CogView3PipelineOutput [[autodoc]] pipelines.cogview3.pipeline_output.CogView3PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/cogview4.md ================================================ # CogView4 > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). ## CogView4Pipeline [[autodoc]] CogView4Pipeline - all - __call__ ## CogView4PipelineOutput [[autodoc]] pipelines.cogview4.pipeline_output.CogView4PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/consisid.md ================================================ # ConsisID
LoRA
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://huggingface.co/papers/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan. The abstract from the paper is: *Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh). There are two official ConsisID checkpoints for identity-preserving text-to-video. | checkpoints | recommended inference dtype | |:---:|:---:| | [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | | [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 | ### Memory optimization ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script. | Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved | | :----------------------------- | :------------------- | :------------------ | | - | 37 GB | 44 GB | | enable_model_cpu_offload | 22 GB | 25 GB | | enable_sequential_cpu_offload | 16 GB | 22 GB | | vae.enable_slicing | 16 GB | 22 GB | | vae.enable_tiling | 5 GB | 7 GB | ## ConsisIDPipeline [[autodoc]] ConsisIDPipeline - all - __call__ ## ConsisIDPipelineOutput [[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/consistency_models.md ================================================ # Consistency Models Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. The abstract from the paper is: *Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai). The pipeline was contributed by [dg845](https://github.com/dg845) and [ayushtues](https://huggingface.co/ayushtues). ❤️ ## Tips For an additional speed-up, use `torch.compile` to generate multiple images in <1 second: ```diff import torch from diffusers import ConsistencyModelPipeline device = "cuda" # Load the cd_bedroom256_lpips checkpoint. model_id_or_path = "openai/diffusers-cd_bedroom256_lpips" pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) # Multistep sampling # Timesteps can be explicitly specified; the particular timesteps below are from the original GitHub repo: # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83 for _ in range(10): image = pipe(timesteps=[17, 0]).images[0] image.show() ``` ## ConsistencyModelPipeline [[autodoc]] ConsistencyModelPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/control_flux_inpaint.md ================================================ # FluxControlInpaint
LoRA
FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image. FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**. | Control type | Developer | Link | | -------- | ---------- | ---- | | Depth | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) | | Canny | [Black Forest Labs](https://huggingface.co/black-forest-labs) | [Link](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) | > [!TIP] > Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c). ```python import torch from diffusers import FluxControlInpaintPipeline from diffusers.models.transformers import FluxTransformer2DModel from transformers import T5EncoderModel from diffusers.utils import load_image, make_image_grid from image_gen_aux import DepthPreprocessor # https://github.com/huggingface/image_gen_aux from PIL import Image import numpy as np pipe = FluxControlInpaintPipeline.from_pretrained( "black-forest-labs/FLUX.1-Depth-dev", torch_dtype=torch.bfloat16, ) # use following lines if you have GPU constraints # --------------------------------------------------------------- transformer = FluxTransformer2DModel.from_pretrained( "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="transformer", torch_dtype=torch.bfloat16 ) text_encoder_2 = T5EncoderModel.from_pretrained( "sayakpaul/FLUX.1-Depth-dev-nf4", subfolder="text_encoder_2", torch_dtype=torch.bfloat16 ) pipe.transformer = transformer pipe.text_encoder_2 = text_encoder_2 pipe.enable_model_cpu_offload() # --------------------------------------------------------------- pipe.to("cuda") prompt = "a blue robot singing opera with human-like expressions" image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") head_mask = np.zeros_like(image) head_mask[65:580,300:642] = 255 mask_image = Image.fromarray(head_mask) processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") control_image = processor(image)[0].convert("RGB") output = pipe( prompt=prompt, image=image, control_image=control_image, mask_image=mask_image, num_inference_steps=30, strength=0.9, guidance_scale=10.0, generator=torch.Generator().manual_seed(42), ).images[0] make_image_grid([image, control_image, mask_image, output.resize(image.size)], rows=1, cols=4).save("output.png") ``` ## FluxControlInpaintPipeline [[autodoc]] FluxControlInpaintPipeline - all - __call__ ## FluxPipelineOutput [[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet.md ================================================ # ControlNet
LoRA
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This model was contributed by [takuma104](https://huggingface.co/takuma104). ❤️ The original codebase can be found at [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet), and you can find official ControlNet checkpoints on [lllyasviel's](https://huggingface.co/lllyasviel) Hub profile. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionControlNetPipeline [[autodoc]] StableDiffusionControlNetPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion ## StableDiffusionControlNetImg2ImgPipeline [[autodoc]] StableDiffusionControlNetImg2ImgPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion ## StableDiffusionControlNetInpaintPipeline [[autodoc]] StableDiffusionControlNetInpaintPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet_flux.md ================================================ # ControlNet with Flux.1
LoRA
FluxControlNetPipeline is an implementation of ControlNet for Flux.1. ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This controlnet code is implemented by [The InstantX Team](https://huggingface.co/InstantX). You can find pre-trained checkpoints for Flux-ControlNet in the table below: | ControlNet type | Developer | Link | | -------- | ---------- | ---- | | Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Canny) | | Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/Shakker-Labs/FLUX.1-dev-ControlNet-Depth) | | Union | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union) | XLabs ControlNets are also supported, which was contributed by the [XLabs team](https://huggingface.co/XLabs-AI). | ControlNet type | Developer | Link | | -------- | ---------- | ---- | | Canny | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-canny-diffusers) | | Depth | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-depth-diffusers) | | HED | [The XLabs Team](https://huggingface.co/XLabs-AI) | [Link](https://huggingface.co/XLabs-AI/flux-controlnet-hed-diffusers) | > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## FluxControlNetPipeline [[autodoc]] FluxControlNetPipeline - all - __call__ ## FluxPipelineOutput [[autodoc]] pipelines.flux.pipeline_output.FluxPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet_hunyuandit.md ================================================ # ControlNet with Hunyuan-DiT HunyuanDiTControlNetPipeline is an implementation of ControlNet for [Hunyuan-DiT](https://huggingface.co/papers/2405.08748). ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Hunyuan-DiT generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This code is implemented by Tencent Hunyuan Team. You can find pre-trained checkpoints for Hunyuan-DiT ControlNets on [Tencent Hunyuan](https://huggingface.co/Tencent-Hunyuan). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## HunyuanDiTControlNetPipeline [[autodoc]] HunyuanDiTControlNetPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/controlnet_sana.md ================================================ # ControlNet
LoRA
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This pipeline was contributed by [ishan24](https://huggingface.co/ishan24). ❤️ The original codebase can be found at [NVlabs/Sana](https://github.com/NVlabs/Sana), and you can find official ControlNet checkpoints on [Efficient-Large-Model's](https://huggingface.co/Efficient-Large-Model) Hub profile. ## SanaControlNetPipeline [[autodoc]] SanaControlNetPipeline - all - __call__ ## SanaPipelineOutput [[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet_sd3.md ================================================ # ControlNet with Stable Diffusion 3
LoRA
StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3. ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This controlnet code is mainly implemented by [The InstantX Team](https://huggingface.co/InstantX). The inpainting-related code was developed by [The Alimama Creative Team](https://huggingface.co/alimama-creative). You can find pre-trained checkpoints for SD3-ControlNet in the table below: | ControlNet type | Developer | Link | | -------- | ---------- | ---- | | Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Canny) | | Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Depth) | | Pose | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Pose) | | Tile | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Tile) | | Inpainting | [The AlimamaCreative Team](https://huggingface.co/alimama-creative) | [link](https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting) | > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusion3ControlNetPipeline [[autodoc]] StableDiffusion3ControlNetPipeline - all - __call__ ## StableDiffusion3ControlNetInpaintingPipeline [[autodoc]] pipelines.controlnet_sd3.pipeline_stable_diffusion_3_controlnet_inpainting.StableDiffusion3ControlNetInpaintingPipeline - all - __call__ ## StableDiffusion3PipelineOutput [[autodoc]] pipelines.stable_diffusion_3.pipeline_output.StableDiffusion3PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet_sdxl.md ================================================ # ControlNet with Stable Diffusion XL
LoRA
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: *We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the 🤗 [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub. > [!WARNING] > 🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve! If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](../../../../../examples/controlnet/README_sdxl). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionXLControlNetPipeline [[autodoc]] StableDiffusionXLControlNetPipeline - all - __call__ ## StableDiffusionXLControlNetImg2ImgPipeline [[autodoc]] StableDiffusionXLControlNetImg2ImgPipeline - all - __call__ ## StableDiffusionXLControlNetInpaintPipeline [[autodoc]] StableDiffusionXLControlNetInpaintPipeline - all - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnet_union.md ================================================ # ControlNetUnion
LoRA
ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL. The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation. *We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters.* ## StableDiffusionXLControlNetUnionPipeline [[autodoc]] StableDiffusionXLControlNetUnionPipeline - all - __call__ ## StableDiffusionXLControlNetUnionImg2ImgPipeline [[autodoc]] StableDiffusionXLControlNetUnionImg2ImgPipeline - all - __call__ ## StableDiffusionXLControlNetUnionInpaintPipeline [[autodoc]] StableDiffusionXLControlNetUnionInpaintPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/controlnetxs.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # ControlNet-XS
LoRA
ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. ControlNet-XS generates images with comparable quality to a regular ControlNet, but it is 20-25% faster ([see benchmark](https://github.com/UmerHA/controlnet-xs-benchmark/blob/main/Speed%20Benchmark.ipynb) with StableDiffusion-XL) and uses ~45% less memory. Here's the overview from the [project page](https://vislearn.github.io/ControlNet-XS/): *With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license.* This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionControlNetXSPipeline [[autodoc]] StableDiffusionControlNetXSPipeline - all - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/controlnetxs_sdxl.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # ControlNet-XS with Stable Diffusion XL ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results. Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. ControlNet-XS generates images with comparable quality to a regular ControlNet, but it is 20-25% faster ([see benchmark](https://github.com/UmerHA/controlnet-xs-benchmark/blob/main/Speed%20Benchmark.ipynb)) and uses ~45% less memory. Here's the overview from the [project page](https://vislearn.github.io/ControlNet-XS/): *With increasing computing capabilities, current model architectures appear to follow the trend of simply upscaling all components without validating the necessity for doing so. In this project we investigate the size and architectural design of ControlNet [Zhang et al., 2023] for controlling the image generation process with stable diffusion-based models. We show that a new architecture with as little as 1% of the parameters of the base model achieves state-of-the art results, considerably better than ControlNet in terms of FID score. Hence we call it ControlNet-XS. We provide the code for controlling StableDiffusion-XL [Podell et al., 2023] (Model B, 48M Parameters) and StableDiffusion 2.1 [Rombach et al. 2022] (Model B, 14M Parameters), all under openrail license.* This model was contributed by [UmerHA](https://twitter.com/UmerHAdil). ❤️ > [!WARNING] > 🧪 Many of the SDXL ControlNet checkpoints are experimental, and there is a lot of room for improvement. Feel free to open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and leave us feedback on how we can improve! > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionXLControlNetXSPipeline [[autodoc]] StableDiffusionXLControlNetXSPipeline - all - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/cosmos.md ================================================ # Cosmos [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA. *Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Basic usage ```python import torch from diffusers import Cosmos2_5_PredictBasePipeline from diffusers.utils import export_to_video model_id = "nvidia/Cosmos-Predict2.5-2B" pipe = Cosmos2_5_PredictBasePipeline.from_pretrained( model_id, revision="diffusers/base/post-trained", torch_dtype=torch.bfloat16 ) pipe.to("cuda") prompt = "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow advance of traffic through the frosty city corridor." negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality." output = pipe( image=None, video=None, prompt=prompt, negative_prompt=negative_prompt, num_frames=93, generator=torch.Generator().manual_seed(1), ).frames[0] export_to_video(output, "text2world.mp4", fps=16) ``` ## Cosmos2_5_TransferPipeline [[autodoc]] Cosmos2_5_TransferPipeline - all - __call__ ## Cosmos2_5_PredictBasePipeline [[autodoc]] Cosmos2_5_PredictBasePipeline - all - __call__ ## CosmosTextToWorldPipeline [[autodoc]] CosmosTextToWorldPipeline - all - __call__ ## CosmosVideoToWorldPipeline [[autodoc]] CosmosVideoToWorldPipeline - all - __call__ ## Cosmos2TextToImagePipeline [[autodoc]] Cosmos2TextToImagePipeline - all - __call__ ## Cosmos2VideoToWorldPipeline [[autodoc]] Cosmos2VideoToWorldPipeline - all - __call__ ## CosmosPipelineOutput [[autodoc]] pipelines.cosmos.pipeline_output.CosmosPipelineOutput ## CosmosImagePipelineOutput [[autodoc]] pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/dance_diffusion.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Dance Diffusion [Dance Diffusion](https://github.com/Harmonai-org/sample-generator) is by Zach Evans. Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## DanceDiffusionPipeline [[autodoc]] DanceDiffusionPipeline - all - __call__ ## AudioPipelineOutput [[autodoc]] pipelines.AudioPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/ddim.md ================================================ # DDIM [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. The abstract from the paper is: *Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* The original codebase can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim). ## DDIMPipeline [[autodoc]] DDIMPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/ddpm.md ================================================ # DDPM [Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the 🤗 Diffusers library, DDPM refers to the *discrete denoising scheduler* from the paper as well as the pipeline. The abstract from the paper is: *We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.* The original codebase can be found at [hohonathanho/diffusion](https://github.com/hojonathanho/diffusion). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. # DDPMPipeline [[autodoc]] DDPMPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/deepfloyd_if.md ================================================ # DeepFloyd IF
LoRA MPS
## Overview DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: - Stage 1: a base model that generates 64x64 px image based on text prompt, - Stage 2: a 64x64 px => 256x256 px super-resolution model, and - Stage 3: a 256x256 px => 1024x1024 px super-resolution model Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. Stage 3 is [Stability AI's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis. ## Usage Before you can use IF, you need to accept its usage conditions. To do so: 1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in. 2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models. 3. Make sure to login locally. Install `huggingface_hub`: ```sh pip install huggingface_hub --upgrade ``` run the login function in a Python shell: ```py from huggingface_hub import login login() ``` and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/security-tokens#what-are-user-access-tokens). Next we install `diffusers` and dependencies: ```sh pip install -q diffusers accelerate transformers ``` The following sections give more in-detail examples of how to use IF. Specifically: - [Text-to-Image Generation](#text-to-image-generation) - [Image-to-Image Generation](#text-guided-image-to-image-generation) - [Inpainting](#text-guided-inpainting-generation) - [Reusing model weights](#converting-between-different-pipelines) - [Speed optimization](#optimizing-for-speed) - [Memory optimization](#optimizing-for-memory) **Available checkpoints** - *Stage-1* - [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) - [DeepFloyd/IF-I-L-v1.0](https://huggingface.co/DeepFloyd/IF-I-L-v1.0) - [DeepFloyd/IF-I-M-v1.0](https://huggingface.co/DeepFloyd/IF-I-M-v1.0) - *Stage-2* - [DeepFloyd/IF-II-L-v1.0](https://huggingface.co/DeepFloyd/IF-II-L-v1.0) - [DeepFloyd/IF-II-M-v1.0](https://huggingface.co/DeepFloyd/IF-II-M-v1.0) - *Stage-3* - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) **Google Colab** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) ### Text-to-Image Generation By default diffusers makes use of [model cpu offloading](../../optimization/memory#model-offloading) to run the whole IF pipeline with as little as 14 GB of VRAM. ```python from diffusers import DiffusionPipeline from diffusers.utils import pt_to_pil, make_image_grid import torch # stage 1 stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload() # stage 2 stage_2 = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload() # stage 3 safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload() prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' generator = torch.manual_seed(1) # text embeds prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 stage_1_output = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images #pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 stage_2_output = stage_2( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images #pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") # stage 3 stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images #stage_3_output[0].save("./if_stage_III.png") make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3) ``` ### Text Guided Image-to-Image Generation The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the [`IFImg2ImgPipeline`] and [`IFImg2ImgSuperResolutionPipeline`] pipelines. **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the [`~DiffusionPipeline.components`] argument as explained [here](#converting-between-different-pipelines). ```python from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch # download image url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" original_image = load_image(url) original_image = original_image.resize((768, 512)) # stage 1 stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload() # stage 2 stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload() # stage 3 safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload() prompt = "A fantasy landscape in style minecraft" generator = torch.manual_seed(1) # text embeds prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 stage_1_output = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images #pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 stage_2_output = stage_2( image=stage_1_output, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images #pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") # stage 3 stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images #stage_3_output[0].save("./if_stage_III.png") make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4) ``` ### Text Guided Inpainting Generation The same IF model weights can be used for text-guided image-to-image translation or image variation. In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines. **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines). ```python from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch # download image url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" original_image = load_image(url) # download mask url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" mask_image = load_image(url) # stage 1 stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) stage_1.enable_model_cpu_offload() # stage 2 stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16 ) stage_2.enable_model_cpu_offload() # stage 3 safety_modules = { "feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker, } stage_3 = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16 ) stage_3.enable_model_cpu_offload() prompt = "blue sunglasses" generator = torch.manual_seed(1) # text embeds prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 stage_1_output = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images #pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 stage_2_output = stage_2( image=stage_1_output, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images #pt_to_pil(stage_1_output)[0].save("./if_stage_II.png") # stage 3 stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images #stage_3_output[0].save("./if_stage_III.png") make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5) ``` ### Converting between different pipelines In addition to being loaded with `from_pretrained`, Pipelines can also be loaded directly from each other. ```python from diffusers import IFPipeline, IFSuperResolutionPipeline pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0") pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0") from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline pipe_1 = IFImg2ImgPipeline(**pipe_1.components) pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components) from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline pipe_1 = IFInpaintingPipeline(**pipe_1.components) pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components) ``` ### Optimizing for speed The simplest optimization to run IF faster is to move all model components to the GPU. ```py pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda") ``` You can also run the diffusion process for a shorter number of timesteps. This can either be done with the `num_inference_steps` argument: ```py pipe("", num_inference_steps=30) ``` Or with the `timesteps` argument: ```py from diffusers.pipelines.deepfloyd_if import fast27_timesteps pipe("", timesteps=fast27_timesteps) ``` When doing image variation or inpainting, you can also decrease the number of timesteps with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster. ```py pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda") image = pipe(image=image, prompt="", strength=0.3).images ``` You can also use [`torch.compile`](../../optimization/fp16#torchcompile). Note that we have not exhaustively tested `torch.compile` with IF and it might not give expected results. ```py from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda") pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) ``` ### Optimizing for memory When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs. Either the model based CPU offloading, ```py pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() ``` or the more aggressive layer based CPU offloading. ```py pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload() ``` Additionally, T5 can be loaded in 8bit precision ```py from transformers import T5EncoderModel text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" ) from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder unet=None, device_map="auto", ) prompt_embeds, negative_embeds = pipe.encode_prompt("") ``` For CPU RAM constrained machines like Google Colab free tier where we can't load all model components to the CPU at once, we can manually only load the pipeline with the text encoder or UNet when the respective model components are needed. ```py from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel from diffusers.utils import pt_to_pil, make_image_grid text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" ) # text to image pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder unet=None, device_map="auto", ) prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"' prompt_embeds, negative_embeds = pipe.encode_prompt(prompt) # Remove the pipeline so we can re-load the pipeline with the unet del text_encoder del pipe gc.collect() torch.cuda.empty_cache() pipe = IFPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" ) generator = torch.Generator().manual_seed(0) stage_1_output = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images #pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # Remove the pipeline so we can load the super-resolution pipeline del pipe gc.collect() torch.cuda.empty_cache() # First super resolution pipe = IFSuperResolutionPipeline.from_pretrained( "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto" ) generator = torch.Generator().manual_seed(0) stage_2_output = pipe( image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images #pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2) ``` ## Available Pipelines: | Pipeline | Tasks | Colab |---|---|:---:| | [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | | [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py) | *Text-to-Image Generation* | - | | [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - | | [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - | | [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - | | [pipeline_if_inpainting_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting_superresolution.py) | *Image-to-Image Generation* | - | ## IFPipeline [[autodoc]] IFPipeline - all - __call__ ## IFSuperResolutionPipeline [[autodoc]] IFSuperResolutionPipeline - all - __call__ ## IFImg2ImgPipeline [[autodoc]] IFImg2ImgPipeline - all - __call__ ## IFImg2ImgSuperResolutionPipeline [[autodoc]] IFImg2ImgSuperResolutionPipeline - all - __call__ ## IFInpaintingPipeline [[autodoc]] IFInpaintingPipeline - all - __call__ ## IFInpaintingSuperResolutionPipeline [[autodoc]] IFInpaintingSuperResolutionPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/diffedit.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # DiffEdit [DiffEdit: Diffusion-based semantic image editing with mask guidance](https://huggingface.co/papers/2210.11427) is by Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. The abstract from the paper is: *Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.* The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion), and you can try it out in this [demo](https://blog.problemsolversguild.com/posts/2022-11-02-diffedit-implementation.html). This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️ ## Tips * The pipeline can generate masks that can be fed into other inpainting pipelines. * In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`]) and a set of partially inverted latents (generated using [`~StableDiffusionDiffEditPipeline.invert`]) _must_ be provided as arguments when calling the pipeline to generate the final edited image. * The function [`~StableDiffusionDiffEditPipeline.generate_mask`] exposes two prompt arguments, `source_prompt` and `target_prompt` that let you control the locations of the semantic edits in the final image to be generated. Let's say, you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect this in the generated mask, you simply have to set the embeddings related to the phrases including "cat" to `source_prompt` and "dog" to `target_prompt`. * When generating partially inverted latents using `invert`, assign a caption or text embedding describing the overall image to the `prompt` argument to help guide the inverse latent sampling process. In most cases, the source concept is sufficiently descriptive to yield good results, but feel free to explore alternatives. * When calling the pipeline to generate the final edited image, assign the source concept to `negative_prompt` and the target concept to `prompt`. Taking the above example, you simply have to set the embeddings related to the phrases including "cat" to `negative_prompt` and "dog" to `prompt`. * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to: * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`. * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog". * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image. * The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](../../using-diffusers/diffedit) guide for more details. ## StableDiffusionDiffEditPipeline [[autodoc]] StableDiffusionDiffEditPipeline - all - generate_mask - invert - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/dit.md ================================================ # DiT [Scalable Diffusion Models with Transformers](https://huggingface.co/papers/2212.09748) (DiT) is by William Peebles and Saining Xie. The abstract from the paper is: *We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.* The original codebase can be found at [facebookresearch/dit](https://github.com/facebookresearch/dit). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## DiTPipeline [[autodoc]] DiTPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/easyanimate.md ================================================ # EasyAnimate [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI. The description from it's GitHub page: *EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.* This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai). There are two official EasyAnimate checkpoints for text-to-video and video-to-video. | checkpoints | recommended inference dtype | |:---:|:---:| | [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 | | [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 | There is one official EasyAnimate checkpoints available for image-to-video and video-to-video. | checkpoints | recommended inference dtype | |:---:|:---:| | [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 | There are two official EasyAnimate checkpoints available for control-to-video. | checkpoints | recommended inference dtype | |:---:|:---:| | [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 | | [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 | For the EasyAnimateV5.1 series: - Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024. - Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`EasyAnimatePipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline from diffusers.utils import export_to_video quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained( "alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = EasyAnimatePipeline.from_pretrained( "alibaba-pai/EasyAnimateV5.1-12b-zh", transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "A cat walks on the grass, realistic style." negative_prompt = "bad detailed" video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0] export_to_video(video, "cat.mp4", fps=8) ``` ## EasyAnimatePipeline [[autodoc]] EasyAnimatePipeline - all - __call__ ## EasyAnimatePipelineOutput [[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/flux.md ================================================ # Flux
LoRA MPS
Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs. Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux). > [!TIP] > Flux can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. For an exhaustive list of resources, check out [this gist](https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c). > > [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. Flux comes in the following variants: | model type | model id | |:----------:|:--------:| | Timestep-distilled | [`black-forest-labs/FLUX.1-schnell`](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | | Guidance-distilled | [`black-forest-labs/FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev) | | Fill Inpainting/Outpainting (Guidance-distilled) | [`black-forest-labs/FLUX.1-Fill-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) | | Canny Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Canny-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) | | Depth Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Depth-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) | | Canny Control (LoRA) | [`black-forest-labs/FLUX.1-Canny-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev-lora) | | Depth Control (LoRA) | [`black-forest-labs/FLUX.1-Depth-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev-lora) | | Redux (Adapter) | [`black-forest-labs/FLUX.1-Redux-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev) | | Kontext | [`black-forest-labs/FLUX.1-kontext`](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev) | All checkpoints have different usage which we detail below. ### Timestep-distilled * `max_sequence_length` cannot be more than 256. * `guidance_scale` needs to be 0. * As this is a timestep-distilled model, it benefits from fewer sampling steps. ```python import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() prompt = "A cat holding a sign that says hello world" out = pipe( prompt=prompt, guidance_scale=0., height=768, width=1360, num_inference_steps=4, max_sequence_length=256, ).images[0] out.save("image.png") ``` ### Guidance-distilled * The guidance-distilled variant takes about 50 sampling steps for good-quality generation. * It doesn't have any limitations around the `max_sequence_length`. ```python import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() prompt = "a tiny astronaut hatching from an egg on the moon" out = pipe( prompt=prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50, ).images[0] out.save("image.png") ``` ### Fill Inpainting/Outpainting * Flux Fill pipeline does not require `strength` as an input like regular inpainting pipelines. * It supports both inpainting and outpainting. ```python import torch from diffusers import FluxFillPipeline from diffusers.utils import load_image image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup.png") mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup_mask.png") repo_id = "black-forest-labs/FLUX.1-Fill-dev" pipe = FluxFillPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to("cuda") image = pipe( prompt="a white paper cup", image=image, mask_image=mask, height=1632, width=1232, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save(f"output.png") ``` ### Canny Control **Note:** `black-forest-labs/Flux.1-Canny-dev` is _not_ a [`ControlNetModel`] model. ControlNet models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Canny Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible. ```python # !pip install -U controlnet-aux import torch from controlnet_aux import CannyDetector from diffusers import FluxControlPipeline from diffusers.utils import load_image pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", torch_dtype=torch.bfloat16).to("cuda") prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") processor = CannyDetector() control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024) image = pipe( prompt=prompt, control_image=control_image, height=1024, width=1024, num_inference_steps=50, guidance_scale=30.0, ).images[0] image.save("output.png") ``` Canny Control is also possible with a LoRA variant of this condition. The usage is as follows: ```python # !pip install -U controlnet-aux import torch from controlnet_aux import CannyDetector from diffusers import FluxControlPipeline from diffusers.utils import load_image pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda") pipe.load_lora_weights("black-forest-labs/FLUX.1-Canny-dev-lora") prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") processor = CannyDetector() control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024) image = pipe( prompt=prompt, control_image=control_image, height=1024, width=1024, num_inference_steps=50, guidance_scale=30.0, ).images[0] image.save("output.png") ``` ### Depth Control **Note:** `black-forest-labs/Flux.1-Depth-dev` is _not_ a ControlNet model. [`ControlNetModel`] models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Depth Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible. ```python # !pip install git+https://github.com/huggingface/image_gen_aux import torch from diffusers import FluxControlPipeline, FluxTransformer2DModel from diffusers.utils import load_image from image_gen_aux import DepthPreprocessor pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Depth-dev", torch_dtype=torch.bfloat16).to("cuda") prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") control_image = processor(control_image)[0].convert("RGB") image = pipe( prompt=prompt, control_image=control_image, height=1024, width=1024, num_inference_steps=30, guidance_scale=10.0, generator=torch.Generator().manual_seed(42), ).images[0] image.save("output.png") ``` Depth Control is also possible with a LoRA variant of this condition. The usage is as follows: ```python # !pip install git+https://github.com/huggingface/image_gen_aux import torch from diffusers import FluxControlPipeline, FluxTransformer2DModel from diffusers.utils import load_image from image_gen_aux import DepthPreprocessor pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("cuda") pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora") prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") control_image = processor(control_image)[0].convert("RGB") image = pipe( prompt=prompt, control_image=control_image, height=1024, width=1024, num_inference_steps=30, guidance_scale=10.0, generator=torch.Generator().manual_seed(42), ).images[0] image.save("output.png") ``` ### Redux * Flux Redux pipeline is an adapter for FLUX.1 base models. It can be used with both flux-dev and flux-schnell, for image-to-image generation. * You can first use the `FluxPriorReduxPipeline` to get the `prompt_embeds` and `pooled_prompt_embeds`, and then feed them into the `FluxPipeline` for image-to-image generation. * When use `FluxPriorReduxPipeline` with a base pipeline, you can set `text_encoder=None` and `text_encoder_2=None` in the base pipeline, in order to save VRAM. ```python import torch from diffusers import FluxPriorReduxPipeline, FluxPipeline from diffusers.utils import load_image device = "cuda" dtype = torch.bfloat16 repo_redux = "black-forest-labs/FLUX.1-Redux-dev" repo_base = "black-forest-labs/FLUX.1-dev" pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(repo_redux, torch_dtype=dtype).to(device) pipe = FluxPipeline.from_pretrained( repo_base, text_encoder=None, text_encoder_2=None, torch_dtype=torch.bfloat16 ).to(device) image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy/img5.png") pipe_prior_output = pipe_prior_redux(image) images = pipe( guidance_scale=2.5, num_inference_steps=50, generator=torch.Generator("cpu").manual_seed(0), **pipe_prior_output, ).images images[0].save("flux-redux.png") ``` ### Kontext Flux Kontext is a model that allows in-context control of the image generation process, allowing for editing, refinement, relighting, style transfer, character customization, and more. ```python import torch from diffusers import FluxKontextPipeline from diffusers.utils import load_image pipe = FluxKontextPipeline.from_pretrained( "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 ) pipe.to("cuda") image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yarn-art-pikachu.png").convert("RGB") prompt = "Make Pikachu hold a sign that says 'Black Forest Labs is awesome', yarn art style, detailed, vibrant colors" image = pipe( image=image, prompt=prompt, guidance_scale=2.5, generator=torch.Generator().manual_seed(42), ).images[0] image.save("flux-kontext.png") ``` Flux Kontext comes with an integrity safety checker, which should be run after the image generation step. To run the safety checker, install the official repository from [black-forest-labs/flux](https://github.com/black-forest-labs/flux) and add the following code: ```python from flux.content_filters import PixtralContentFilter # ... pipeline invocation to generate images integrity_checker = PixtralContentFilter(torch.device("cuda")) image_ = np.array(image) / 255.0 image_ = 2 * image_ - 1 image_ = torch.from_numpy(image_).to("cuda", dtype=torch.float32).unsqueeze(0).permute(0, 3, 1, 2) if integrity_checker.test_image(image_): raise ValueError("Your image has been flagged. Choose another prompt/image or try again.") ``` ### Kontext Inpainting `FluxKontextInpaintPipeline` enables image modification within a fixed mask region. It currently supports both text-based conditioning and image-reference conditioning. ```python import torch from diffusers import FluxKontextInpaintPipeline from diffusers.utils import load_image prompt = "Change the yellow dinosaur to green one" img_url = ( "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/dinosaur_input.jpeg?raw=true" ) mask_url = ( "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/dinosaur_mask.png?raw=true" ) source = load_image(img_url) mask = load_image(mask_url) pipe = FluxKontextInpaintPipeline.from_pretrained( "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 ) pipe.to("cuda") image = pipe(prompt=prompt, image=source, mask_image=mask, strength=1.0).images[0] image.save("kontext_inpainting_normal.png") ``` ```python import torch from diffusers import FluxKontextInpaintPipeline from diffusers.utils import load_image pipe = FluxKontextInpaintPipeline.from_pretrained( "black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16 ) pipe.to("cuda") prompt = "Replace this ball" img_url = "https://images.pexels.com/photos/39362/the-ball-stadion-football-the-pitch-39362.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500" mask_url = "https://github.com/ZenAI-Vietnam/Flux-Kontext-pipelines/blob/main/assets/ball_mask.png?raw=true" image_reference_url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTah3x6OL_ECMBaZ5ZlJJhNsyC-OSMLWAI-xw&s" source = load_image(img_url) mask = load_image(mask_url) image_reference = load_image(image_reference_url) mask = pipe.mask_processor.blur(mask, blur_factor=12) image = pipe( prompt=prompt, image=source, mask_image=mask, image_reference=image_reference, strength=1.0 ).images[0] image.save("kontext_inpainting_ref.png") ``` ## Combining Flux Turbo LoRAs with Flux Control, Fill, and Redux We can combine Flux Turbo LoRAs with Flux Control and other pipelines like Fill and Redux to enable few-steps' inference. The example below shows how to do that for Flux Control LoRA for depth and turbo LoRA from [`ByteDance/Hyper-SD`](https://hf.co/ByteDance/Hyper-SD). ```py from diffusers import FluxControlPipeline from image_gen_aux import DepthPreprocessor from diffusers.utils import load_image from huggingface_hub import hf_hub_download import torch control_pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16) control_pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora", adapter_name="depth") control_pipe.load_lora_weights( hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd" ) control_pipe.set_adapters(["depth", "hyper-sd"], adapter_weights=[0.85, 0.125]) control_pipe.enable_model_cpu_offload() prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png") processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf") control_image = processor(control_image)[0].convert("RGB") image = control_pipe( prompt=prompt, control_image=control_image, height=1024, width=1024, num_inference_steps=8, guidance_scale=10.0, generator=torch.Generator().manual_seed(42), ).images[0] image.save("output.png") ``` ## Note about `unload_lora_weights()` when using Flux LoRAs When unloading the Control LoRA weights, call `pipe.unload_lora_weights(reset_to_overwritten_params=True)` to reset the `pipe.transformer` completely back to its original form. The resultant pipeline can then be used with methods like [`DiffusionPipeline.from_pipe`]. More details about this argument are available in [this PR](https://github.com/huggingface/diffusers/pull/10397). ## IP-Adapter > [!TIP] > Check out [IP-Adapter](../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work. An IP-Adapter lets you prompt Flux with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. ```python import torch from diffusers import FluxPipeline from diffusers.utils import load_image pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flux_ip_adapter_input.jpg").resize((1024, 1024)) pipe.load_ip_adapter( "XLabs-AI/flux-ip-adapter", weight_name="ip_adapter.safetensors", image_encoder_pretrained_model_name_or_path="openai/clip-vit-large-patch14" ) pipe.set_ip_adapter_scale(1.0) image = pipe( width=1024, height=1024, prompt="wearing sunglasses", negative_prompt="", true_cfg_scale=4.0, generator=torch.Generator().manual_seed(4444), ip_adapter_image=image, ).images[0] image.save('flux_ip_adapter_output.jpg') ```
IP-Adapter examples with prompt "wearing sunglasses"
## Optimize Flux is a very large model and requires ~50GB of RAM/VRAM to load all the modeling components. Enable some of the optimizations below to lower the memory requirements. ### Group offloading [Group offloading](../../optimization/memory#group-offloading) lowers VRAM usage by offloading groups of internal layers rather than the whole model or weights. You need to use [`~hooks.apply_group_offloading`] on all the model components of a pipeline. The `offload_type` parameter allows you to toggle between block and leaf-level offloading. Setting it to `leaf_level` offloads the lowest leaf-level parameters to the CPU instead of offloading at the module-level. On CUDA devices that support asynchronous data streaming, set `use_stream=True` to overlap data transfer and computation to accelerate inference. > [!TIP] > It is possible to mix block and leaf-level offloading for different components in a pipeline. ```py import torch from diffusers import FluxPipeline from diffusers.hooks import apply_group_offloading model_id = "black-forest-labs/FLUX.1-dev" dtype = torch.bfloat16 pipe = FluxPipeline.from_pretrained( model_id, torch_dtype=dtype, ) apply_group_offloading( pipe.transformer, offload_type="leaf_level", offload_device=torch.device("cpu"), onload_device=torch.device("cuda"), use_stream=True, ) apply_group_offloading( pipe.text_encoder, offload_device=torch.device("cpu"), onload_device=torch.device("cuda"), offload_type="leaf_level", use_stream=True, ) apply_group_offloading( pipe.text_encoder_2, offload_device=torch.device("cpu"), onload_device=torch.device("cuda"), offload_type="leaf_level", use_stream=True, ) apply_group_offloading( pipe.vae, offload_device=torch.device("cpu"), onload_device=torch.device("cuda"), offload_type="leaf_level", use_stream=True, ) prompt="A cat wearing sunglasses and working as a lifeguard at pool." generator = torch.Generator().manual_seed(181201) image = pipe( prompt, width=576, height=1024, num_inference_steps=30, generator=generator ).images[0] image ``` ### Running FP16 inference Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details. FP16 inference code: ```python import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # can replace schnell with dev # to run on low vram GPUs (i.e. between 4 and 32 GB VRAM) pipe.enable_sequential_cpu_offload() pipe.vae.enable_slicing() pipe.vae.enable_tiling() pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once prompt = "A cat holding a sign that says hello world" out = pipe( prompt=prompt, guidance_scale=0., height=768, width=1360, num_inference_steps=4, max_sequence_length=256, ).images[0] out.save("image.png") ``` ### Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`FluxPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = FluxTransformer2DModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", text_encoder_2=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0] image.save("flux.png") ``` ## Single File Loading for the `FluxTransformer2DModel` The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community. > [!TIP] > `FP8` inference can be brittle depending on the GPU type, CUDA version, and `torch` version that you are using. It is recommended that you use the `optimum-quanto` library in order to run FP8 inference on your machine. The following example demonstrates how to run Flux with less than 16GB of VRAM. First install `optimum-quanto` ```shell pip install optimum-quanto ``` Then run the following example ```python import torch from diffusers import FluxTransformer2DModel, FluxPipeline from transformers import T5EncoderModel, CLIPTextModel from optimum.quanto import freeze, qfloat8, quantize bfl_repo = "black-forest-labs/FLUX.1-dev" dtype = torch.bfloat16 transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype) quantize(transformer, weights=qfloat8) freeze(transformer) text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype) quantize(text_encoder_2, weights=qfloat8) freeze(text_encoder_2) pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype) pipe.transformer = transformer pipe.text_encoder_2 = text_encoder_2 pipe.enable_model_cpu_offload() prompt = "A cat holding a sign that says hello world" image = pipe( prompt, guidance_scale=3.5, output_type="pil", num_inference_steps=20, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save("flux-fp8-dev.png") ``` ## FluxPipeline [[autodoc]] FluxPipeline - all - __call__ ## FluxImg2ImgPipeline [[autodoc]] FluxImg2ImgPipeline - all - __call__ ## FluxInpaintPipeline [[autodoc]] FluxInpaintPipeline - all - __call__ ## FluxControlNetInpaintPipeline [[autodoc]] FluxControlNetInpaintPipeline - all - __call__ ## FluxControlNetImg2ImgPipeline [[autodoc]] FluxControlNetImg2ImgPipeline - all - __call__ ## FluxControlPipeline [[autodoc]] FluxControlPipeline - all - __call__ ## FluxControlImg2ImgPipeline [[autodoc]] FluxControlImg2ImgPipeline - all - __call__ ## FluxPriorReduxPipeline [[autodoc]] FluxPriorReduxPipeline - all - __call__ ## FluxFillPipeline [[autodoc]] FluxFillPipeline - all - __call__ ## FluxKontextPipeline [[autodoc]] FluxKontextPipeline - all - __call__ ## FluxKontextInpaintPipeline [[autodoc]] FluxKontextInpaintPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/flux2.md ================================================ # Flux2
LoRA MPS
Flux.2 is the recent series of image generation models from Black Forest Labs, preceded by the [Flux.1](./flux.md) series. It is an entirely new model with a new architecture and pre-training done from scratch! Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux2). > [!TIP] > Flux2 can be quite expensive to run on consumer hardware devices. However, you can perform a suite of optimizations to run it faster and in a more memory-friendly manner. Check out [this section](https://huggingface.co/blog/sd3#memory-optimizations-for-sd3) for more details. Additionally, Flux can benefit from quantization for memory efficiency with a trade-off in inference latency. Refer to [this blog post](https://huggingface.co/blog/quanto-diffusers) to learn more. > > [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ## Caption upsampling Flux.2 can potentially generate better better outputs with better prompts. We can "upsample" an input prompt by setting the `caption_upsample_temperature` argument in the pipeline call arguments. The [official implementation](https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L140) recommends this value to be 0.15. ## Flux2Pipeline [[autodoc]] Flux2Pipeline - all - __call__ ## Flux2KleinPipeline [[autodoc]] Flux2KleinPipeline - all - __call__ ## Flux2KleinKVPipeline [[autodoc]] Flux2KleinKVPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/framepack.md ================================================ # Framepack
LoRA
[Packing Input Frame Context in Next-Frame Prediction Models for Video Generation](https://huggingface.co/papers/2504.12626) by Lvmin Zhang and Maneesh Agrawala. *We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Available models | Model name | Description | |:---|:---| - [`lllyasviel/FramePackI2V_HY`](https://huggingface.co/lllyasviel/FramePackI2V_HY) | Trained with the "inverted anti-drifting" strategy as described in the paper. Inference requires setting `sampling_type="inverted_anti_drifting"` when running the pipeline. | - [`lllyasviel/FramePack_F1_I2V_HY_20250503`](https://huggingface.co/lllyasviel/FramePack_F1_I2V_HY_20250503) | Trained with a novel anti-drifting strategy but inference is performed in "vanilla" strategy as described in the paper. Inference requires setting `sampling_type="vanilla"` when running the pipeline. | ## Usage Refer to the pipeline documentation for basic usage examples. The following section contains examples of offloading, different sampling methods, quantization, and more. ### First and last frame to video The following example shows how to use Framepack with start and end image controls, using the inverted anti-drifiting sampling model. ```python import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( "lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16 ) feature_extractor = SiglipImageProcessor.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ) image_encoder = SiglipVisionModel.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16, ) # Enable memory optimizations pipe.enable_model_cpu_offload() pipe.vae.enable_tiling() prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." first_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png" ) last_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png" ) output = pipe( image=first_image, last_image=last_image, prompt=prompt, height=512, width=512, num_frames=91, num_inference_steps=30, guidance_scale=9.0, generator=torch.Generator().manual_seed(0), sampling_type="inverted_anti_drifting", ).frames[0] export_to_video(output, "output.mp4", fps=30) ``` ### Vanilla sampling The following example shows how to use Framepack with the F1 model trained with vanilla sampling but new regulation approach for anti-drifting. ```python import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16 ) feature_extractor = SiglipImageProcessor.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ) image_encoder = SiglipVisionModel.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16, ) # Enable memory optimizations pipe.enable_model_cpu_offload() pipe.vae.enable_tiling() image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" ) output = pipe( image=image, prompt="A penguin dancing in the snow", height=832, width=480, num_frames=91, num_inference_steps=30, guidance_scale=9.0, generator=torch.Generator().manual_seed(0), sampling_type="vanilla", ).frames[0] export_to_video(output, "output.mp4", fps=30) ``` ### Group offloading Group offloading ([`~hooks.apply_group_offloading`]) provides aggressive memory optimizations for offloading internal parts of any model to the CPU, with possibly no additional overhead to generation time. If you have very low VRAM available, this approach may be suitable for you depending on the amount of CPU RAM available. ```python import torch from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video, load_image from transformers import SiglipImageProcessor, SiglipVisionModel transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained( "lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16 ) feature_extractor = SiglipImageProcessor.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="feature_extractor" ) image_encoder = SiglipVisionModel.from_pretrained( "lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16 ) pipe = HunyuanVideoFramepackPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16, ) # Enable group offloading onload_device = torch.device("cuda") offload_device = torch.device("cpu") list(map( lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True, low_cpu_mem_usage=True), [pipe.text_encoder, pipe.text_encoder_2, pipe.transformer] )) pipe.image_encoder.to(onload_device) pipe.vae.to(onload_device) pipe.vae.enable_tiling() image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" ) output = pipe( image=image, prompt="A penguin dancing in the snow", height=832, width=480, num_frames=91, num_inference_steps=30, guidance_scale=9.0, generator=torch.Generator().manual_seed(0), sampling_type="vanilla", ).frames[0] print(f"Max memory: {torch.cuda.max_memory_allocated() / 1024**3:.3f} GB") export_to_video(output, "output.mp4", fps=30) ``` ## HunyuanVideoFramepackPipeline [[autodoc]] HunyuanVideoFramepackPipeline - all - __call__ ## HunyuanVideoPipelineOutput [[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/glm_image.md ================================================ # GLM-Image ## Overview GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios. Model architecture: a hybrid autoregressive + diffusion decoder design、 + Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs. You can check AR model in class `GlmImageForConditionalGeneration` of `transformers` library. + Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images. Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality. + Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness. + Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering. GLM-Image supports both text-to-image and image-to-image generation within a single model + Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios. + Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects. This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The codebase can be found [here](https://huggingface.co/zai-org/GLM-Image). ## Usage examples ### Text to Image Generation ```python import torch from diffusers.pipelines.glm_image import GlmImagePipeline pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda") prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy." image = pipe( prompt=prompt, height=32 * 32, width=36 * 32, num_inference_steps=30, guidance_scale=1.5, generator=torch.Generator(device="cuda").manual_seed(42), ).images[0] image.save("output_t2i.png") ``` ### Image to Image Generation ```python import torch from diffusers.pipelines.glm_image import GlmImagePipeline from PIL import Image pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda") image_path = "cond.jpg" prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator." image = Image.open(image_path).convert("RGB") image = pipe( prompt=prompt, image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1] height=33 * 32, width=32 * 32, num_inference_steps=30, guidance_scale=1.5, generator=torch.Generator(device="cuda").manual_seed(42), ).images[0] image.save("output_i2i.png") ``` + Since the AR model used in GLM-Image is configured with `do_sample=True` and a temperature of `0.95` by default, the generated images can vary significantly across runs. We do not recommend setting do_sample=False, as this may lead to incorrect or degenerate outputs from the AR model. ## GlmImagePipeline [[autodoc]] pipelines.glm_image.pipeline_glm_image.GlmImagePipeline - all - __call__ ## GlmImagePipelineOutput [[autodoc]] pipelines.glm_image.pipeline_output.GlmImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/helios.md ================================================ # Helios [Helios: Real Real-Time Long Video Generation Model](https://huggingface.co/papers/2603.04379) from Peking University & ByteDance & etc, by Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan. * We introduce Helios, the first 14B video generation model that runs at 17 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching a strong baseline in quality. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drift heuristics such as self-forcing, error banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, causal masking, or sparse attention; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize its typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to—or lower than—those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. All the code and models are available at [this https URL](https://pku-yuangroup.github.io/Helios-Page). The following Helios models are supported in Diffusers: - [Helios-Base](https://huggingface.co/BestWishYsh/Helios-Base): Best Quality, with v-prediction, standard CFG and custom HeliosScheduler. - [Helios-Mid](https://huggingface.co/BestWishYsh/Helios-Mid): Intermediate Weight, with v-prediction, CFG-Zero* and custom HeliosScheduler. - [Helios-Distilled](https://huggingface.co/BestWishYsh/Helios-Distilled): Best Efficiency, with x0-prediction and custom HeliosDMDScheduler. > [!TIP] > Click on the Helios models in the right sidebar for more examples of video generation. ### Optimizing Memory and Inference Speed The example below demonstrates how to generate a video from text optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The Helios model below requires ~6GB of VRAM. ```py import torch from diffusers import AutoModel, HeliosPipeline from diffusers.hooks.group_offloading import apply_group_offloading from diffusers.utils import export_to_video vae = AutoModel.from_pretrained("BestWishYsh/Helios-Base", subfolder="vae", torch_dtype=torch.float32) # group-offloading pipeline = HeliosPipeline.from_pretrained( "BestWishYsh/Helios-Base", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.enable_group_offload( onload_device=torch.device("cuda"), offload_device=torch.device("cpu"), offload_type="leaf_level", use_stream=True, record_stream=True, ) prompt = """ A vibrant tropical fish swimming gracefully among colorful coral reefs in a clear, turquoise ocean. The fish has bright blue and yellow scales with a small, distinctive orange spot on its side, its fins moving fluidly. The coral reefs are alive with a variety of marine life, including small schools of colorful fish and sea turtles gliding by. The water is crystal clear, allowing for a view of the sandy ocean floor below. The reef itself is adorned with a mix of hard and soft corals in shades of red, orange, and green. The photo captures the fish from a slightly elevated angle, emphasizing its lively movements and the vivid colors of its surroundings. A close-up shot with dynamic movement. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=99, num_inference_steps=50, guidance_scale=5.0, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_base_t2v_output.mp4", fps=24) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Attention Backends](../../optimization/attention_backends) such as FlashAttention and SageAttention can significantly increase speed by optimizing the computation of the attention mechanism. [Context Parallelism](../../training/distributed_inference#context-parallelism) splits the input sequence across multiple devices to enable processing of long contexts in parallel, reducing memory pressure and latency. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ```py import torch from diffusers import AutoModel, HeliosPipeline from diffusers.utils import export_to_video vae = AutoModel.from_pretrained("BestWishYsh/Helios-Base", subfolder="vae", torch_dtype=torch.float32) pipeline = HeliosPipeline.from_pretrained( "BestWishYsh/Helios-Base", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") # attention backend # pipeline.transformer.set_attention_backend("flash") pipeline.transformer.set_attention_backend("_flash_3_hub") # For Hopper GPUs # torch.compile torch.backends.cudnn.benchmark = True pipeline.text_encoder.compile(mode="max-autotune-no-cudagraphs", dynamic=False) pipeline.vae.compile(mode="max-autotune-no-cudagraphs", dynamic=False) pipeline.transformer.compile(mode="max-autotune-no-cudagraphs", dynamic=False) prompt = """ A vibrant tropical fish swimming gracefully among colorful coral reefs in a clear, turquoise ocean. The fish has bright blue and yellow scales with a small, distinctive orange spot on its side, its fins moving fluidly. The coral reefs are alive with a variety of marine life, including small schools of colorful fish and sea turtles gliding by. The water is crystal clear, allowing for a view of the sandy ocean floor below. The reef itself is adorned with a mix of hard and soft corals in shades of red, orange, and green. The photo captures the fish from a slightly elevated angle, emphasizing its lively movements and the vivid colors of its surroundings. A close-up shot with dynamic movement. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=99, num_inference_steps=50, guidance_scale=5.0, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_base_t2v_output.mp4", fps=24) ``` ### Generation with Helios-Base The example below demonstrates how to use Helios-Base to generate video based on text, image or video. ```python import torch from diffusers import AutoModel, HeliosPipeline from diffusers.utils import export_to_video, load_video, load_image vae = AutoModel.from_pretrained("BestWishYsh/Helios-Base", subfolder="vae", torch_dtype=torch.float32) pipeline = HeliosPipeline.from_pretrained( "BestWishYsh/Helios-Base", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ # For Text-to-Video prompt = """ A vibrant tropical fish swimming gracefully among colorful coral reefs in a clear, turquoise ocean. The fish has bright blue and yellow scales with a small, distinctive orange spot on its side, its fins moving fluidly. The coral reefs are alive with a variety of marine life, including small schools of colorful fish and sea turtles gliding by. The water is crystal clear, allowing for a view of the sandy ocean floor below. The reef itself is adorned with a mix of hard and soft corals in shades of red, orange, and green. The photo captures the fish from a slightly elevated angle, emphasizing its lively movements and the vivid colors of its surroundings. A close-up shot with dynamic movement. """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=99, num_inference_steps=50, guidance_scale=5.0, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_base_t2v_output.mp4", fps=24) # For Image-to-Video prompt = """ A towering emerald wave surges forward, its crest curling with raw power and energy. Sunlight glints off the translucent water, illuminating the intricate textures and deep green hues within the wave’s body. A thick spray erupts from the breaking crest, casting a misty veil that dances above the churning surface. As the perspective widens, the immense scale of the wave becomes apparent, revealing the restless expanse of the ocean stretching beyond. The scene captures the ocean’s untamed beauty and relentless force, with every droplet and ripple shimmering in the light. The dynamic motion and vivid colors evoke both awe and respect for nature’s might. """ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/wave.jpg" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, image=load_image(image_path).resize((640, 384)), num_frames=99, num_inference_steps=50, guidance_scale=5.0, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_base_i2v_output.mp4", fps=24) # For Video-to-Video prompt = """ A bright yellow Lamborghini Huracn Tecnica speeds along a curving mountain road, surrounded by lush green trees under a partly cloudy sky. The car's sleek design and vibrant color stand out against the natural backdrop, emphasizing its dynamic movement. The road curves gently, with a guardrail visible on one side, adding depth to the scene. The motion blur captures the sense of speed and energy, creating a thrilling and exhilarating atmosphere. A front-facing shot from a slightly elevated angle, highlighting the car's aggressive stance and the surrounding greenery. """ video_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/car.mp4" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, video=load_video(video_path), num_frames=99, num_inference_steps=50, guidance_scale=5.0, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_base_v2v_output.mp4", fps=24) ``` ### Generation with Helios-Mid The example below demonstrates how to use Helios-Mid to generate video based on text, image or video. ```python import torch from diffusers import AutoModel, HeliosPyramidPipeline from diffusers.utils import export_to_video, load_video, load_image vae = AutoModel.from_pretrained("BestWishYsh/Helios-Mid", subfolder="vae", torch_dtype=torch.float32) pipeline = HeliosPyramidPipeline.from_pretrained( "BestWishYsh/Helios-Mid", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ # For Text-to-Video prompt = """ A vibrant tropical fish swimming gracefully among colorful coral reefs in a clear, turquoise ocean. The fish has bright blue and yellow scales with a small, distinctive orange spot on its side, its fins moving fluidly. The coral reefs are alive with a variety of marine life, including small schools of colorful fish and sea turtles gliding by. The water is crystal clear, allowing for a view of the sandy ocean floor below. The reef itself is adorned with a mix of hard and soft corals in shades of red, orange, and green. The photo captures the fish from a slightly elevated angle, emphasizing its lively movements and the vivid colors of its surroundings. A close-up shot with dynamic movement. """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=99, pyramid_num_inference_steps_list=[20, 20, 20], guidance_scale=5.0, use_zero_init=True, zero_steps=1, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_pyramid_t2v_output.mp4", fps=24) # For Image-to-Video prompt = """ A towering emerald wave surges forward, its crest curling with raw power and energy. Sunlight glints off the translucent water, illuminating the intricate textures and deep green hues within the wave’s body. A thick spray erupts from the breaking crest, casting a misty veil that dances above the churning surface. As the perspective widens, the immense scale of the wave becomes apparent, revealing the restless expanse of the ocean stretching beyond. The scene captures the ocean’s untamed beauty and relentless force, with every droplet and ripple shimmering in the light. The dynamic motion and vivid colors evoke both awe and respect for nature’s might. """ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/wave.jpg" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, image=load_image(image_path).resize((640, 384)), num_frames=99, pyramid_num_inference_steps_list=[20, 20, 20], guidance_scale=5.0, use_zero_init=True, zero_steps=1, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_pyramid_i2v_output.mp4", fps=24) # For Video-to-Video prompt = """ A bright yellow Lamborghini Huracn Tecnica speeds along a curving mountain road, surrounded by lush green trees under a partly cloudy sky. The car's sleek design and vibrant color stand out against the natural backdrop, emphasizing its dynamic movement. The road curves gently, with a guardrail visible on one side, adding depth to the scene. The motion blur captures the sense of speed and energy, creating a thrilling and exhilarating atmosphere. A front-facing shot from a slightly elevated angle, highlighting the car's aggressive stance and the surrounding greenery. """ video_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/car.mp4" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, video=load_video(video_path), num_frames=99, pyramid_num_inference_steps_list=[20, 20, 20], guidance_scale=5.0, use_zero_init=True, zero_steps=1, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_pyramid_v2v_output.mp4", fps=24) ``` ### Generation with Helios-Distilled The example below demonstrates how to use Helios-Distilled to generate video based on text, image or video. ```python import torch from diffusers import AutoModel, HeliosPyramidPipeline from diffusers.utils import export_to_video, load_video, load_image vae = AutoModel.from_pretrained("BestWishYsh/Helios-Distilled", subfolder="vae", torch_dtype=torch.float32) pipeline = HeliosPyramidPipeline.from_pretrained( "BestWishYsh/Helios-Distilled", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ # For Text-to-Video prompt = """ A vibrant tropical fish swimming gracefully among colorful coral reefs in a clear, turquoise ocean. The fish has bright blue and yellow scales with a small, distinctive orange spot on its side, its fins moving fluidly. The coral reefs are alive with a variety of marine life, including small schools of colorful fish and sea turtles gliding by. The water is crystal clear, allowing for a view of the sandy ocean floor below. The reef itself is adorned with a mix of hard and soft corals in shades of red, orange, and green. The photo captures the fish from a slightly elevated angle, emphasizing its lively movements and the vivid colors of its surroundings. A close-up shot with dynamic movement. """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=240, pyramid_num_inference_steps_list=[2, 2, 2], guidance_scale=1.0, is_amplify_first_chunk=True, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_distilled_t2v_output.mp4", fps=24) # For Image-to-Video prompt = """ A towering emerald wave surges forward, its crest curling with raw power and energy. Sunlight glints off the translucent water, illuminating the intricate textures and deep green hues within the wave’s body. A thick spray erupts from the breaking crest, casting a misty veil that dances above the churning surface. As the perspective widens, the immense scale of the wave becomes apparent, revealing the restless expanse of the ocean stretching beyond. The scene captures the ocean’s untamed beauty and relentless force, with every droplet and ripple shimmering in the light. The dynamic motion and vivid colors evoke both awe and respect for nature’s might. """ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/wave.jpg" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, image=load_image(image_path).resize((640, 384)), num_frames=240, pyramid_num_inference_steps_list=[2, 2, 2], guidance_scale=1.0, is_amplify_first_chunk=True, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_distilled_i2v_output.mp4", fps=24) # For Video-to-Video prompt = """ A bright yellow Lamborghini Huracn Tecnica speeds along a curving mountain road, surrounded by lush green trees under a partly cloudy sky. The car's sleek design and vibrant color stand out against the natural backdrop, emphasizing its dynamic movement. The road curves gently, with a guardrail visible on one side, adding depth to the scene. The motion blur captures the sense of speed and energy, creating a thrilling and exhilarating atmosphere. A front-facing shot from a slightly elevated angle, highlighting the car's aggressive stance and the surrounding greenery. """ video_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/helios/car.mp4" output = pipeline( prompt=prompt, negative_prompt=negative_prompt, video=load_video(video_path), num_frames=240, pyramid_num_inference_steps_list=[2, 2, 2], guidance_scale=1.0, is_amplify_first_chunk=True, generator=torch.Generator("cuda").manual_seed(42), ).frames[0] export_to_video(output, "helios_distilled_v2v_output.mp4", fps=24) ``` ## HeliosPipeline [[autodoc]] HeliosPipeline - all - __call__ ## HeliosPyramidPipeline [[autodoc]] HeliosPyramidPipeline - all - __call__ ## HeliosPipelineOutput [[autodoc]] pipelines.helios.pipeline_output.HeliosPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/hidream.md ================================================ # HiDreamImage [HiDream-I1](https://huggingface.co/HiDream-ai) by HiDream.ai > [!TIP] > [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ## Available models The following models are available for the [`HiDreamImagePipeline`] pipeline: | Model name | Description | |:---|:---| | [`HiDream-ai/HiDream-I1-Full`](https://huggingface.co/HiDream-ai/HiDream-I1-Full) | - | | [`HiDream-ai/HiDream-I1-Dev`](https://huggingface.co/HiDream-ai/HiDream-I1-Dev) | - | | [`HiDream-ai/HiDream-I1-Fast`](https://huggingface.co/HiDream-ai/HiDream-I1-Fast) | - | ## HiDreamImagePipeline [[autodoc]] HiDreamImagePipeline - all - __call__ ## HiDreamImagePipelineOutput [[autodoc]] pipelines.hidream_image.pipeline_output.HiDreamImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/hunyuan_video.md ================================================ # HunyuanVideo [HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate. You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization. > [!TIP] > Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks. > > The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers. The example below demonstrates how to generate a video optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The quantized HunyuanVideo model below requires ~14GB of VRAM. ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize="transformer" ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize="transformer" ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # torch.compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer = torch.compile( pipeline.transformer, mode="max-autotune", fullgraph=True ) prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ``` ## Notes - HunyuanVideo supports LoRAs with [`~loaders.HunyuanVideoLoraLoaderMixin.load_lora_weights`].
Show example code ```py import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize="transformer" ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # load LoRA weights pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie") pipeline.set_adapters("steamboat-willie", 0.9) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # use "In the style of SWR" to trigger the LoRA prompt = """ In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys. """ video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ```
- Refer to the table below for recommended inference values. | parameter | recommended value | |---|---| | text encoder dtype | `torch.float16` | | transformer dtype | `torch.bfloat16` | | vae dtype | `torch.float16` | | `num_frames (k)` | 4 * `k` + 1 | - Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. ## HunyuanVideoPipeline [[autodoc]] HunyuanVideoPipeline - all - __call__ ## HunyuanVideoPipelineOutput [[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/hunyuan_video15.md ================================================ # HunyuanVideo-1.5 HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models. You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization. > [!TIP] > Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks. > > The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers. The example below demonstrates how to generate a video optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. ```py import torch from diffusers import AutoModel, HunyuanVideo15Pipeline from diffusers.utils import export_to_video pipeline = HunyuanVideo15Pipeline.from_pretrained( "HunyuanVideo-1.5-Diffusers-480p_t2v", torch_dtype=torch.bfloat16, ) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15) ``` ## Notes - HunyuanVideo1.5 use attention masks with variable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently. - **H100/H800:** `_flash_3_hub` or `_flash_3_varlen_hub` - **A100/A800/RTX 4090:** `flash_hub` or `flash_varlen_hub` - **Other GPUs:** `sage_hub` Refer to the [Attention backends](../../optimization/attention_backends) guide for more details about using a different backend. ```py pipe.transformer.set_attention_backend("flash_hub") # or your preferred backend ``` - [`HunyuanVideo15Pipeline`] use guider and does not take `guidance_scale` parameter at runtime. You can check the default guider configuration using `pipe.guider`: ```py >>> pipe.guider ClassifierFreeGuidance { "_class_name": "ClassifierFreeGuidance", "_diffusers_version": "0.36.0.dev0", "enabled": true, "guidance_rescale": 0.0, "guidance_scale": 6.0, "start": 0.0, "stop": 1.0, "use_original_formulation": false } State: step: None num_inference_steps: None timestep: None count_prepared: 0 enabled: True num_conditions: 2 ``` To update guider configuration, you can run `pipe.guider = pipe.guider.new(...)` ```py pipe.guider = pipe.guider.new(guidance_scale=5.0) ``` Read more on Guider [here](../../using-diffusers/guiders). ## HunyuanVideo15Pipeline [[autodoc]] HunyuanVideo15Pipeline - all - __call__ ## HunyuanVideo15ImageToVideoPipeline [[autodoc]] HunyuanVideo15ImageToVideoPipeline - all - __call__ ## HunyuanVideo15PipelineOutput [[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/hunyuandit.md ================================================ # Hunyuan-DiT ![chinese elements understanding](https://github.com/gnobitab/diffusers-hunyuan/assets/1157982/39b99036-c3cb-4f16-bb1a-40ec25eda573) [Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding](https://huggingface.co/papers/2405.08748) from Tencent Hunyuan. The abstract from the paper is: *We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.* You can find the original codebase at [Tencent/HunyuanDiT](https://github.com/Tencent/HunyuanDiT) and all the available checkpoints at [Tencent-Hunyuan](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). **Highlights**: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation. HunyuanDiT has the following components: * It uses a diffusion transformer as the backbone * It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. > [!TIP] > You can further improve generation quality by passing the generated image from [`HungyuanDiTPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. ## Optimization You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides. ### Inference Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. First, load the pipeline: ```python from diffusers import HunyuanDiTPipeline import torch pipeline = HunyuanDiTPipeline.from_pretrained( "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16 ).to("cuda") ``` Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: ```python pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) ``` Finally, compile the components and run inference: ```python pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) image = pipeline(prompt="一个宇航员在骑马").images[0] ``` The [benchmark](https://gist.github.com/sayakpaul/29d3a14905cfcbf611fe71ebd22e9b23) results on a 80GB A100 machine are: ```bash With torch.compile(): Average inference time: 12.470 seconds. Without torch.compile(): Average inference time: 20.570 seconds. ``` ### Memory optimization By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details. Furthermore, you can use the [`~HunyuanDiT2DModel.enable_forward_chunking`] method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime. ```diff + pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1) ``` ## HunyuanDiTPipeline [[autodoc]] HunyuanDiTPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/hunyuanimage21.md ================================================ # HunyuanImage2.1 HunyuanImage-2.1 is a 17B text-to-image model that is capable of generating 2K (2048 x 2048) resolution images HunyuanImage-2.1 comes in the following variants: | model type | model id | |:----------:|:--------:| | HunyuanImage-2.1 | [hunyuanvideo-community/HunyuanImage-2.1-Diffusers](https://huggingface.co/hunyuanvideo-community/HunyuanImage-2.1-Diffusers) | | HunyuanImage-2.1-Distilled | [hunyuanvideo-community/HunyuanImage-2.1-Distilled-Diffusers](https://huggingface.co/hunyuanvideo-community/HunyuanImage-2.1-Distilled-Diffusers) | | HunyuanImage-2.1-Refiner | [hunyuanvideo-community/HunyuanImage-2.1-Refiner-Diffusers](https://huggingface.co/hunyuanvideo-community/HunyuanImage-2.1-Refiner-Diffusers) | > [!TIP] > [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ## HunyuanImage-2.1 HunyuanImage-2.1 applies [Adaptive Projected Guidance (APG)](https://huggingface.co/papers/2410.02416) combined with Classifier-Free Guidance (CFG) in the denoising loop. `HunyuanImagePipeline` has a `guider` component (read more about [Guider](../../using-diffusers/guiders)) and does not take a `guidance_scale` parameter at runtime. To change guider-related parameters, e.g., `guidance_scale`, you can update the `guider` configuration instead. ```python import torch from diffusers import HunyuanImagePipeline pipe = HunyuanImagePipeline.from_pretrained( "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", torch_dtype=torch.bfloat16 ) pipe = pipe.to("cuda") ``` You can inspect the `guider` object: ```py >>> pipe.guider AdaptiveProjectedMixGuidance { "_class_name": "AdaptiveProjectedMixGuidance", "_diffusers_version": "0.36.0.dev0", "adaptive_projected_guidance_momentum": -0.5, "adaptive_projected_guidance_rescale": 10.0, "adaptive_projected_guidance_scale": 10.0, "adaptive_projected_guidance_start_step": 5, "enabled": true, "eta": 0.0, "guidance_rescale": 0.0, "guidance_scale": 3.5, "start": 0.0, "stop": 1.0, "use_original_formulation": false } State: step: None num_inference_steps: None timestep: None count_prepared: 0 enabled: True num_conditions: 2 momentum_buffer: None is_apg_enabled: False is_cfg_enabled: True ``` To update the guider with a different configuration, use the `new()` method. For example, to generate an image with `guidance_scale=5.0` while keeping all other default guidance parameters: ```py import torch from diffusers import HunyuanImagePipeline pipe = HunyuanImagePipeline.from_pretrained( "hunyuanvideo-community/HunyuanImage-2.1-Diffusers", torch_dtype=torch.bfloat16 ) pipe = pipe.to("cuda") # Update the guider configuration pipe.guider = pipe.guider.new(guidance_scale=5.0) prompt = ( "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, " "wearing a red knitted scarf and a red beret with the word 'Tencent' on it, holding a paintbrush with a " "focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style." ) image = pipe( prompt=prompt, num_inference_steps=50, height=2048, width=2048, ).images[0] image.save("image.png") ``` ## HunyuanImage-2.1-Distilled use `distilled_guidance_scale` with the guidance-distilled checkpoint, ```py import torch from diffusers import HunyuanImagePipeline pipe = HunyuanImagePipeline.from_pretrained("hunyuanvideo-community/HunyuanImage-2.1-Distilled-Diffusers", torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") prompt = ( "A cute, cartoon-style anthropomorphic penguin plush toy with fluffy fur, standing in a painting studio, " "wearing a red knitted scarf and a red beret with the word 'Tencent' on it, holding a paintbrush with a " "focused expression as it paints an oil painting of the Mona Lisa, rendered in a photorealistic photographic style." ) out = pipe( prompt, num_inference_steps=8, distilled_guidance_scale=3.25, height=2048, width=2048, generator=generator, ).images[0] ``` ## HunyuanImagePipeline [[autodoc]] HunyuanImagePipeline - all - __call__ ## HunyuanImageRefinerPipeline [[autodoc]] HunyuanImageRefinerPipeline - all - __call__ ## HunyuanImagePipelineOutput [[autodoc]] pipelines.hunyuan_image.pipeline_output.HunyuanImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/i2vgenxl.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # I2VGen-XL [I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. The abstract from the paper is: *Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).* The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage). Sample output with I2VGenXL:
library.
library
## Notes * I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP. * It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD). * Unlike SVD, it additionally accepts text prompts as inputs. * It can generate higher resolution videos. * When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results. * This implementation is 1-stage variant of I2VGenXL. The main figure in the [I2VGen-XL](https://huggingface.co/papers/2311.04145) paper shows a 2-stage variant, however, 1-stage variant works well. See [this discussion](https://github.com/huggingface/diffusers/discussions/7952) for more details. ## I2VGenXLPipeline [[autodoc]] I2VGenXLPipeline - all - __call__ ## I2VGenXLPipelineOutput [[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/kandinsky.md ================================================ # Kandinsky 2.1 Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). The description from it's GitHub page is: *Kandinsky 2.1 inherits best practicies from Dall-E 2 and Latent diffusion, while introducing some new ideas. As text and image encoder it uses CLIP model and diffusion image prior (mapping) between latent spaces of CLIP modalities. This approach increases the visual performance of the model and unveils new horizons in blending images and text-guided image manipulation.* The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2). > [!TIP] > Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## KandinskyPriorPipeline [[autodoc]] KandinskyPriorPipeline - all - __call__ - interpolate ## KandinskyPipeline [[autodoc]] KandinskyPipeline - all - __call__ ## KandinskyCombinedPipeline [[autodoc]] KandinskyCombinedPipeline - all - __call__ ## KandinskyImg2ImgPipeline [[autodoc]] KandinskyImg2ImgPipeline - all - __call__ ## KandinskyImg2ImgCombinedPipeline [[autodoc]] KandinskyImg2ImgCombinedPipeline - all - __call__ ## KandinskyInpaintPipeline [[autodoc]] KandinskyInpaintPipeline - all - __call__ ## KandinskyInpaintCombinedPipeline [[autodoc]] KandinskyInpaintCombinedPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/kandinsky3.md ================================================ # Kandinsky 3
LoRA
Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh) The description from it's GitHub page: *Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.* Its architecture includes 3 main components: 1. [FLAN-UL2](https://huggingface.co/google/flan-ul2), which is an encoder decoder model based on the T5 architecture. 2. New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters. 3. Sber-MoVQGAN is a decoder proven to have superior results in image restoration. The original codebase can be found at [ai-forever/Kandinsky-3](https://github.com/ai-forever/Kandinsky-3). > [!TIP] > Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. > [!TIP] > Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Kandinsky3Pipeline [[autodoc]] Kandinsky3Pipeline - all - __call__ ## Kandinsky3Img2ImgPipeline [[autodoc]] Kandinsky3Img2ImgPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/kandinsky5_image.md ================================================ # Kandinsky 5.0 Image [Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation. Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters). The model introduces several key innovations: - **Latent diffusion pipeline** with **Flow Matching** for improved training stability - **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings - Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding - **Flux VAE** for efficient image encoding and decoding The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5). > [!TIP] > Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants. ## Available Models Kandinsky 5.0 Image Lite: | model_id | Description | Use Cases | |------------|-------------|-----------| | [**kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers) | 6B image Supervised Fine-Tuned model | Highest generation quality | | [**kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers) | 6B image editing Supervised Fine-Tuned model | Highest generation quality | | [**kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-pretrain-Diffusers) | 6B image Base pretrained model | Research and fine-tuning | | [**kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-I2I-Lite-pretrain-Diffusers) | 6B image editing Base pretrained model | Research and fine-tuning | ## Usage Examples ### Basic Text-to-Image Generation ```python import torch from diffusers import Kandinsky5T2IPipeline # Load the pipeline model_id = "kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers" pipe = Kandinsky5T2IPipeline.from_pretrained(model_id) _ = pipe.to(device='cuda',dtype=torch.bfloat16) # Generate image prompt = "A fluffy, expressive cat wearing a bright red hat with a soft, slightly textured fabric. The hat should look cozy and well-fitted on the cat’s head. On the front of the hat, add clean, bold white text that reads “SWEET”, clearly visible and neatly centered. Ensure the overall lighting highlights the hat’s color and the cat’s fur details." output = pipe( prompt=prompt, negative_prompt="", height=1024, width=1024, num_inference_steps=50, guidance_scale=3.5, ).image[0] ``` ### Basic Image-to-Image Generation ```python import torch from diffusers import Kandinsky5I2IPipeline from diffusers.utils import load_image # Load the pipeline model_id = "kandinskylab/Kandinsky-5.0-I2I-Lite-sft-Diffusers" pipe = Kandinsky5I2IPipeline.from_pretrained(model_id) _ = pipe.to(device='cuda',dtype=torch.bfloat16) pipe.enable_model_cpu_offload() # <--- Enable CPU offloading for single GPU inference # Edit the input image image = load_image( "https://huggingface.co/kandinsky-community/kandinsky-3/resolve/main/assets/title.jpg?download=true" ) prompt = "Change the background from a winter night scene to a bright summer day. Place the character on a sandy beach with clear blue sky, soft sunlight, and gentle waves in the distance. Replace the winter clothing with a light short-sleeved T-shirt (in soft pastel colors) and casual shorts. Ensure the character’s fur reflects warm daylight instead of cold winter tones. Add small beach details such as seashells, footprints in the sand, and a few scattered beach toys nearby. Keep the oranges in the scene, but place them naturally on the sand." negative_prompt = "" output = pipe( image=image, prompt=prompt, negative_prompt=negative_prompt, guidance_scale=3.5, ).image[0] ``` ## Kandinsky5T2IPipeline [[autodoc]] Kandinsky5T2IPipeline - all - __call__ ## Kandinsky5I2IPipeline [[autodoc]] Kandinsky5I2IPipeline - all - __call__ ## Citation ```bibtex @misc{kandinsky2025, author = {Alexander Belykh and Alexander Varlamov and Alexey Letunovskiy and Anastasia Aliaskina and Anastasia Maltseva and Anastasiia Kargapoltseva and Andrey Shutkin and Anna Averchenkova and Anna Dmitrienko and Bulat Akhmatov and Denis Dimitrov and Denis Koposov and Denis Parkhomenko and Dmitrii and Ilya Vasiliev and Ivan Kirillov and Julia Agafonova and Kirill Chernyshev and Kormilitsyn Semen and Lev Novitskiy and Maria Kovaleva and Mikhail Mamaev and Mikhailov and Nikita Kiselev and Nikita Osterov and Nikolai Gerasimenko and Nikolai Vaulin and Olga Kim and Olga Vdovchenko and Polina Gavrilova and Polina Mikhailova and Tatiana Nikulina and Viacheslav Vasilev and Vladimir Arkhipkin and Vladimir Korviakov and Vladimir Polovnikov and Yury Kolabushin}, title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation}, howpublished = {\url{https://github.com/kandinskylab/Kandinsky-5}}, year = 2025 } ``` ================================================ FILE: docs/source/en/api/pipelines/kandinsky5_video.md ================================================ # Kandinsky 5.0 Video [Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation. Kandinsky 5.0 Lite line-up of lightweight video generation models (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem. Kandinsky 5.0 Pro line-up of large high quality video generation models (19B parameters). It offers high qualty generation in HD and more generation formats like I2V. The model introduces several key innovations: - **Latent diffusion pipeline** with **Flow Matching** for improved training stability - **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings - Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding - **HunyuanVideo 3D VAE** for efficient video encoding and decoding - **Sparse attention mechanisms** (NABLA) for efficient long-sequence processing The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5). > [!TIP] > Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants. ## Available Models Kandinsky 5.0 T2V Pro: | model_id | Description | Use Cases | |------------|-------------|-----------| | **kandinskylab/Kandinsky-5.0-T2V-Pro-sft-5s-Diffusers** | 5 second Text-to-Video Pro model | High-quality text-to-video generation | | **kandinskylab/Kandinsky-5.0-I2V-Pro-sft-5s-Diffusers** | 5 second Image-to-Video Pro model | High-quality image-to-video generation | Kandinsky 5.0 T2V Lite: | model_id | Description | Use Cases | |------------|-------------|-----------| | **kandinskylab/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers** | 5 second Supervised Fine-Tuned model | Highest generation quality | | **kandinskylab/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers** | 10 second Supervised Fine-Tuned model | Highest generation quality | | **kandinskylab/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers** | 5 second Classifier-Free Guidance distilled | 2× faster inference | | **kandinskylab/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers** | 10 second Classifier-Free Guidance distilled | 2× faster inference | | **kandinskylab/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers** | 5 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss | | **kandinskylab/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers** | 10 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss | | **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning | | **kandinskylab/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning | ## Usage Examples ### Basic Text-to-Video Generation #### Pro **⚠️ Warning!** all Pro models should be infered with pipeline.enable_model_cpu_offload() ```python import torch from diffusers import Kandinsky5T2VPipeline from diffusers.utils import export_to_video # Load the pipeline model_id = "kandinskylab/Kandinsky-5.0-T2V-Pro-sft-5s-Diffusers" pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") pipeline.transformer.set_attention_backend("flex") # <--- Set attention bakend to Flex pipeline.enable_model_cpu_offload() # <--- Enable cpu offloading for single GPU inference pipeline.transformer.compile(mode="max-autotune-no-cudagraphs", dynamic=True) # <--- Compile with max-autotune-no-cudagraphs # Generate video prompt = "A cat and a dog baking a cake together in a kitchen." negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards" output = pipe( prompt=prompt, negative_prompt=negative_prompt, height=768, width=1024, num_frames=121, # ~5 seconds at 24fps num_inference_steps=50, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=24, quality=9) ``` #### Lite ```python import torch from diffusers import Kandinsky5T2VPipeline from diffusers.utils import export_to_video # Load the pipeline model_id = "kandinskylab/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers" pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") # Generate video prompt = "A cat and a dog baking a cake together in a kitchen." negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards" output = pipe( prompt=prompt, negative_prompt=negative_prompt, height=512, width=768, num_frames=121, # ~5 seconds at 24fps num_inference_steps=50, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=24, quality=9) ``` ### 10 second Models **⚠️ Warning!** all 10 second models should be used with Flex attention and max-autotune-no-cudagraphs compilation: ```python pipe = Kandinsky5T2VPipeline.from_pretrained( "kandinskylab/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers", torch_dtype=torch.bfloat16 ) pipe = pipe.to("cuda") pipe.transformer.set_attention_backend( "flex" ) # <--- Set attention bakend to Flex pipe.transformer.compile( mode="max-autotune-no-cudagraphs", dynamic=True ) # <--- Compile with max-autotune-no-cudagraphs prompt = "A cat and a dog baking a cake together in a kitchen." negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards" output = pipe( prompt=prompt, negative_prompt=negative_prompt, height=512, width=768, num_frames=241, num_inference_steps=50, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=24, quality=9) ``` ### Diffusion Distilled model **⚠️ Warning!** all nocfg and diffusion distilled models should be infered wothout CFG (```guidance_scale=1.0```): ```python model_id = "kandinskylab/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers" pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") output = pipe( prompt="A beautiful sunset over mountains", num_inference_steps=16, # <--- Model is distilled in 16 steps guidance_scale=1.0, # <--- no CFG ).frames[0] export_to_video(output, "output.mp4", fps=24, quality=9) ``` ### Basic Image-to-Video Generation **⚠️ Warning!** all Pro models should be infered with pipeline.enable_model_cpu_offload() ```python import torch from diffusers import Kandinsky5T2VPipeline from diffusers.utils import export_to_video # Load the pipeline model_id = "kandinskylab/Kandinsky-5.0-I2V-Pro-sft-5s-Diffusers" pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") pipeline.transformer.set_attention_backend("flex") # <--- Set attention bakend to Flex pipeline.enable_model_cpu_offload() # <--- Enable cpu offloading for single GPU inference pipeline.transformer.compile(mode="max-autotune-no-cudagraphs", dynamic=True) # <--- Compile with max-autotune-no-cudagraphs # Generate video image = load_image( "https://huggingface.co/kandinsky-community/kandinsky-3/resolve/main/assets/title.jpg?download=true" ) height = 896 width = 896 image = image.resize((width, height)) prompt = "An funny furry creture smiles happily and holds a sign that says 'Kandinsky'" negative_prompt = "" output = pipe( prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, num_frames=121, # ~5 seconds at 24fps num_inference_steps=50, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=24, quality=9) ``` ## Kandinsky 5.0 Pro Side-by-Side evaluation
image image
Comparison with Veo 3 Comparison with Veo 3 fast
image image
Comparison with Wan 2.2 A14B Text-to-Video mode Comparison with Wan 2.2 A14B Image-to-Video mode
## Kandinsky 5.0 Lite Side-by-Side evaluation The evaluation is based on the expanded prompts from the [Movie Gen benchmark](https://github.com/facebookresearch/MovieGenBench), which are available in the expanded_prompt column of the benchmark/moviegen_bench.csv file.
## Kandinsky 5.0 Lite Distill Side-by-Side evaluation
## Kandinsky5T2VPipeline [[autodoc]] Kandinsky5T2VPipeline - all - __call__ ## Kandinsky5I2VPipeline [[autodoc]] Kandinsky5I2VPipeline - all - __call__ ## Citation ```bibtex @misc{kandinsky2025, author = {Alexander Belykh and Alexander Varlamov and Alexey Letunovskiy and Anastasia Aliaskina and Anastasia Maltseva and Anastasiia Kargapoltseva and Andrey Shutkin and Anna Averchenkova and Anna Dmitrienko and Bulat Akhmatov and Denis Dimitrov and Denis Koposov and Denis Parkhomenko and Dmitrii and Ilya Vasiliev and Ivan Kirillov and Julia Agafonova and Kirill Chernyshev and Kormilitsyn Semen and Lev Novitskiy and Maria Kovaleva and Mikhail Mamaev and Mikhailov and Nikita Kiselev and Nikita Osterov and Nikolai Gerasimenko and Nikolai Vaulin and Olga Kim and Olga Vdovchenko and Polina Gavrilova and Polina Mikhailova and Tatiana Nikulina and Viacheslav Vasilev and Vladimir Arkhipkin and Vladimir Korviakov and Vladimir Polovnikov and Yury Kolabushin}, title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation}, howpublished = {\url{https://github.com/kandinskylab/Kandinsky-5}}, year = 2025 } ``` ================================================ FILE: docs/source/en/api/pipelines/kandinsky_v22.md ================================================ # Kandinsky 2.2 Kandinsky 2.2 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). The description from it's GitHub page is: *Kandinsky 2.2 brings substantial improvements upon its predecessor, Kandinsky 2.1, by introducing a new, more powerful image encoder - CLIP-ViT-G and the ControlNet support. The switch to CLIP-ViT-G as the image encoder significantly increases the model's capability to generate more aesthetic pictures and better understand text, thus enhancing the model's overall performance. The addition of the ControlNet mechanism allows the model to effectively control the process of generating images. This leads to more accurate and visually appealing outputs and opens new possibilities for text-guided image manipulation.* The original codebase can be found at [ai-forever/Kandinsky-2](https://github.com/ai-forever/Kandinsky-2). > [!TIP] > Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting. > [!TIP] > Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## KandinskyV22PriorPipeline [[autodoc]] KandinskyV22PriorPipeline - all - __call__ - interpolate ## KandinskyV22Pipeline [[autodoc]] KandinskyV22Pipeline - all - __call__ ## KandinskyV22CombinedPipeline [[autodoc]] KandinskyV22CombinedPipeline - all - __call__ ## KandinskyV22ControlnetPipeline [[autodoc]] KandinskyV22ControlnetPipeline - all - __call__ ## KandinskyV22PriorEmb2EmbPipeline [[autodoc]] KandinskyV22PriorEmb2EmbPipeline - all - __call__ - interpolate ## KandinskyV22Img2ImgPipeline [[autodoc]] KandinskyV22Img2ImgPipeline - all - __call__ ## KandinskyV22Img2ImgCombinedPipeline [[autodoc]] KandinskyV22Img2ImgCombinedPipeline - all - __call__ ## KandinskyV22ControlnetImg2ImgPipeline [[autodoc]] KandinskyV22ControlnetImg2ImgPipeline - all - __call__ ## KandinskyV22InpaintPipeline [[autodoc]] KandinskyV22InpaintPipeline - all - __call__ ## KandinskyV22InpaintCombinedPipeline [[autodoc]] KandinskyV22InpaintCombinedPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/kolors.md ================================================ # Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis
LoRA MPS
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png) Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf). The abstract from the technical report is: *We present Kolors, a latent diffusion model for text-to-image synthesis, characterized by its profound understanding of both English and Chinese, as well as an impressive degree of photorealism. There are three key insights contributing to the development of Kolors. Firstly, unlike large language model T5 used in Imagen and Stable Diffusion 3, Kolors is built upon the General Language Model (GLM), which enhances its comprehension capabilities in both English and Chinese. Moreover, we employ a multimodal large language model to recaption the extensive training dataset for fine-grained text understanding. These strategies significantly improve Kolors’ ability to comprehend intricate semantics, particularly those involving multiple entities, and enable its advanced text rendering capabilities. Secondly, we divide the training of Kolors into two phases: the concept learning phase with broad knowledge and the quality improvement phase with specifically curated high-aesthetic data. Furthermore, we investigate the critical role of the noise schedule and introduce a novel schedule to optimize high-resolution image generation. These strategies collectively enhance the visual appeal of the generated high-resolution images. Lastly, we propose a category-balanced benchmark KolorsPrompts, which serves as a guide for the training and evaluation of Kolors. Consequently, even when employing the commonly used U-Net backbone, Kolors has demonstrated remarkable performance in human evaluations, surpassing the existing open-source models and achieving Midjourney-v6 level performance, especially in terms of visual appeal. We will release the code and weights of Kolors at , and hope that it will benefit future research and applications in the visual generation community.* ## Usage Example ```python import torch from diffusers import DPMSolverMultistepScheduler, KolorsPipeline pipe = KolorsPipeline.from_pretrained("Kwai-Kolors/Kolors-diffusers", torch_dtype=torch.float16, variant="fp16") pipe.to("cuda") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) image = pipe( prompt='一张瓢虫的照片,微距,变焦,高质量,电影,拿着一个牌子,写着"可图"', negative_prompt="", guidance_scale=6.5, num_inference_steps=25, ).images[0] image.save("kolors_sample.png") ``` ### IP Adapter Kolors needs a different IP Adapter to work, and it uses [Openai-CLIP-336](https://huggingface.co/openai/clip-vit-large-patch14-336) as an image encoder. > [!TIP] > Using an IP Adapter with Kolors requires more than 24GB of VRAM. To use it, we recommend using [`~DiffusionPipeline.enable_model_cpu_offload`] on consumer GPUs. > [!TIP] > While Kolors is integrated in Diffusers, you need to load the image encoder from a revision to use the safetensor files. You can still use the main branch of the original repository if you're comfortable loading pickle checkpoints. ```python import torch from transformers import CLIPVisionModelWithProjection from diffusers import DPMSolverMultistepScheduler, KolorsPipeline from diffusers.utils import load_image image_encoder = CLIPVisionModelWithProjection.from_pretrained( "Kwai-Kolors/Kolors-IP-Adapter-Plus", subfolder="image_encoder", low_cpu_mem_usage=True, torch_dtype=torch.float16, revision="refs/pr/4", ) pipe = KolorsPipeline.from_pretrained( "Kwai-Kolors/Kolors-diffusers", image_encoder=image_encoder, torch_dtype=torch.float16, variant="fp16" ) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, use_karras_sigmas=True) pipe.load_ip_adapter( "Kwai-Kolors/Kolors-IP-Adapter-Plus", subfolder="", weight_name="ip_adapter_plus_general.safetensors", revision="refs/pr/4", image_encoder_folder=None, ) pipe.enable_model_cpu_offload() ipa_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/cat_square.png") image = pipe( prompt="best quality, high quality", negative_prompt="", guidance_scale=6.5, num_inference_steps=25, ip_adapter_image=ipa_image, ).images[0] image.save("kolors_ipa_sample.png") ``` ## KolorsPipeline [[autodoc]] KolorsPipeline - all - __call__ ## KolorsImg2ImgPipeline [[autodoc]] KolorsImg2ImgPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/latent_consistency_models.md ================================================ # Latent Consistency Models
LoRA
Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. The abstract of the paper is as follows: *Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: [this https URL](https://latent-consistency-models.github.io/).* A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model). The pipelines were contributed by [luosiallen](https://luosiallen.github.io/), [nagolinc](https://github.com/nagolinc), and [dg845](https://github.com/dg845). ## LatentConsistencyModelPipeline [[autodoc]] LatentConsistencyModelPipeline - all - __call__ - enable_freeu - disable_freeu - enable_vae_slicing - disable_vae_slicing - enable_vae_tiling - disable_vae_tiling ## LatentConsistencyModelImg2ImgPipeline [[autodoc]] LatentConsistencyModelImg2ImgPipeline - all - __call__ - enable_freeu - disable_freeu - enable_vae_slicing - disable_vae_slicing - enable_vae_tiling - disable_vae_tiling ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/latent_diffusion.md ================================================ # Latent Diffusion Latent Diffusion was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.* The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## LDMTextToImagePipeline [[autodoc]] LDMTextToImagePipeline - all - __call__ ## LDMSuperResolutionPipeline [[autodoc]] LDMSuperResolutionPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/latte.md ================================================ # Latte ![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true) [Latte: Latent Diffusion Transformer for Video Generation](https://huggingface.co/papers/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University. The abstract from the paper is: *We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.* **Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - [FaceForensics](https://huggingface.co/papers/1803.09179), [SkyTimelapse](https://huggingface.co/papers/1709.07592), [UCF101](https://huggingface.co/papers/1212.0402) and [Taichi-HD](https://huggingface.co/papers/2003.00196). To prepare and download the datasets for evaluation, please refer to [this https URL](https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md). This pipeline was contributed by [maxin-cn](https://github.com/maxin-cn). The original codebase can be found [here](https://github.com/Vchitect/Latte). The original weights can be found under [hf.co/maxin-cn](https://huggingface.co/maxin-cn). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ### Inference Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. First, load the pipeline: ```python import torch from diffusers import LattePipeline pipeline = LattePipeline.from_pretrained( "maxin-cn/Latte-1", torch_dtype=torch.float16 ).to("cuda") ``` Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: ```python pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) ``` Finally, compile the components and run inference: ```python pipeline.transformer = torch.compile(pipeline.transformer) pipeline.vae.decode = torch.compile(pipeline.vae.decode) video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0] ``` The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are: ``` Without torch.compile(): Average inference time: 16.246 seconds. With torch.compile(): Average inference time: 14.573 seconds. ``` ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LattePipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LatteTransformer3DModel, LattePipeline from diffusers.utils import export_to_gif from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "maxin-cn/Latte-1", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = LatteTransformer3DModel.from_pretrained( "maxin-cn/Latte-1", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = LattePipeline.from_pretrained( "maxin-cn/Latte-1", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "A small cactus with a happy face in the Sahara desert." video = pipeline(prompt).frames[0] export_to_gif(video, "latte.gif") ``` ## LattePipeline [[autodoc]] LattePipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/ledits_pp.md ================================================ # LEDITS++
LoRA
LEDITS++ was proposed in [LEDITS++: Limitless Image Editing using Text-to-Image Models](https://huggingface.co/papers/2311.16711) by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos. The abstract from the paper is: *Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .* > [!TIP] > You can find additional information about LEDITS++ on the [project page](https://leditsplusplus-project.static.hf.space/index.html) and try it out in a [demo](https://huggingface.co/spaces/editing-images/leditsplusplus). > [!WARNING] > Due to some backward compatibility issues with the current diffusers implementation of [`~schedulers.DPMSolverMultistepScheduler`] this implementation of LEdits++ can no longer guarantee perfect inversion. > This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated [GitHub repo](https://github.com/ml-research/ledits_pp). We provide two distinct pipelines based on different pre-trained models. ## LEditsPPPipelineStableDiffusion [[autodoc]] pipelines.ledits_pp.LEditsPPPipelineStableDiffusion - all - __call__ - invert ## LEditsPPPipelineStableDiffusionXL [[autodoc]] pipelines.ledits_pp.LEditsPPPipelineStableDiffusionXL - all - __call__ - invert ## LEditsPPDiffusionPipelineOutput [[autodoc]] pipelines.ledits_pp.pipeline_output.LEditsPPDiffusionPipelineOutput - all ## LEditsPPInversionPipelineOutput [[autodoc]] pipelines.ledits_pp.pipeline_output.LEditsPPInversionPipelineOutput - all ================================================ FILE: docs/source/en/api/pipelines/longcat_image.md ================================================ # LongCat-Image
LoRA
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. ### Key Features - 🌟 **Exceptional Efficiency and Performance**: With only **6B parameters**, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design. - 🌟 **Superior Editing Performance**: LongCat-Image-Edit model achieves state-of-the-art performance among open-source models, delivering leading instruction-following and image quality with superior visual consistency. - 🌟 **Powerful Chinese Text Rendering**: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary. - 🌟 **Remarkable Photorealism**: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images. - 🌟 **Comprehensive Open-Source Ecosystem**: We provide a complete toolchain, from intermediate checkpoints to full training code, significantly lowering the barrier for further research and development. For more details, please refer to the comprehensive [***LongCat-Image Technical Report***](https://arxiv.org/abs/2412.11963) ## Usage Example ```py import torch import diffusers from diffusers import LongCatImagePipeline weight_dtype = torch.bfloat16 pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image", torch_dtype=torch.bfloat16 ) pipe.to('cuda') # pipe.enable_model_cpu_offload() prompt = '一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。' image = pipe( prompt, height=768, width=1344, guidance_scale=4.0, num_inference_steps=50, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True, ).images[0] image.save(f'./longcat_image_t2i_example.png') ``` This pipeline was contributed by LongCat-Image Team. The original codebase can be found [here](https://github.com/meituan-longcat/LongCat-Image). Available models:
Models Type Description Download Link
LongCat‑Image Text‑to‑Image Final Release. The standard model for out‑of‑the‑box inference. 🤗 Huggingface
LongCat‑Image‑Dev Text‑to‑Image Development. Mid-training checkpoint, suitable for fine-tuning. 🤗 Huggingface
LongCat‑Image‑Edit Image Editing Specialized model for image editing. 🤗 Huggingface
## LongCatImagePipeline [[autodoc]] LongCatImagePipeline - all - __call__ ## LongCatImagePipelineOutput [[autodoc]] pipelines.longcat_image.pipeline_output.LongCatImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/ltx2.md ================================================ # LTX-2
LoRA
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization. The original codebase for LTX-2 can be found [here](https://github.com/Lightricks/LTX-2). ## Two-stages Generation Recommended pipeline to achieve production quality generation, this pipeline is composed of two stages: - Stage 1: Generate a video at the target resolution using diffusion sampling with classifier-free guidance (CFG). This stage produces a coherent low-noise video sequence that respects the text/image conditioning. - Stage 2: Upsample the Stage 1 output by 2 and refine details using a distilled LoRA model to improve fidelity and visual quality. Stage 2 may apply lighter CFG to preserve the structure from Stage 1 while enhancing texture and sharpness. Sample usage of text-to-video two stages pipeline ```py import torch from diffusers import FlowMatchEulerDiscreteScheduler from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel from diffusers.pipelines.ltx2.utils import STAGE_2_DISTILLED_SIGMA_VALUES from diffusers.pipelines.ltx2.export_utils import encode_video device = "cuda:0" width = 768 height = 512 pipe = LTX2Pipeline.from_pretrained( "Lightricks/LTX-2", torch_dtype=torch.bfloat16 ) pipe.enable_sequential_cpu_offload(device=device) prompt = "A beautiful sunset over the ocean" negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." # Stage 1 default (non-distilled) inference frame_rate = 24.0 video_latent, audio_latent = pipe( prompt=prompt, negative_prompt=negative_prompt, width=width, height=height, num_frames=121, frame_rate=frame_rate, num_inference_steps=40, sigmas=None, guidance_scale=4.0, output_type="latent", return_dict=False, ) latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained( "Lightricks/LTX-2", subfolder="latent_upsampler", torch_dtype=torch.bfloat16, ) upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler) upsample_pipe.enable_model_cpu_offload(device=device) upscaled_video_latent = upsample_pipe( latents=video_latent, output_type="latent", return_dict=False, )[0] # Load Stage 2 distilled LoRA pipe.load_lora_weights( "Lightricks/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors" ) pipe.set_adapters("stage_2_distilled", 1.0) # VAE tiling is usually necessary to avoid OOM error when VAE decoding pipe.vae.enable_tiling() # Change scheduler to use Stage 2 distilled sigmas as is new_scheduler = FlowMatchEulerDiscreteScheduler.from_config( pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None ) pipe.scheduler = new_scheduler # Stage 2 inference with distilled LoRA and sigmas video, audio = pipe( latents=upscaled_video_latent, audio_latents=audio_latent, prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=3, noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L218 sigmas=STAGE_2_DISTILLED_SIGMA_VALUES, guidance_scale=1.0, output_type="np", return_dict=False, ) encode_video( video[0], fps=frame_rate, audio=audio[0].float().cpu(), audio_sample_rate=pipe.vocoder.config.output_sampling_rate, output_path="ltx2_lora_distilled_sample.mp4", ) ``` ## Distilled checkpoint generation Fastest two-stages generation pipeline using a distilled checkpoint. ```py import torch from diffusers.pipelines.ltx2 import LTX2Pipeline, LTX2LatentUpsamplePipeline from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES from diffusers.pipelines.ltx2.export_utils import encode_video device = "cuda" width = 768 height = 512 random_seed = 42 generator = torch.Generator(device).manual_seed(random_seed) model_path = "rootonchair/LTX-2-19b-distilled" pipe = LTX2Pipeline.from_pretrained( model_path, torch_dtype=torch.bfloat16 ) pipe.enable_sequential_cpu_offload(device=device) prompt = "A beautiful sunset over the ocean" negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." frame_rate = 24.0 video_latent, audio_latent = pipe( prompt=prompt, negative_prompt=negative_prompt, width=width, height=height, num_frames=121, frame_rate=frame_rate, num_inference_steps=8, sigmas=DISTILLED_SIGMA_VALUES, guidance_scale=1.0, generator=generator, output_type="latent", return_dict=False, ) latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained( model_path, subfolder="latent_upsampler", torch_dtype=torch.bfloat16, ) upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler) upsample_pipe.enable_model_cpu_offload(device=device) upscaled_video_latent = upsample_pipe( latents=video_latent, output_type="latent", return_dict=False, )[0] video, audio = pipe( latents=upscaled_video_latent, audio_latents=audio_latent, prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=3, noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/distilled.py#L178 sigmas=STAGE_2_DISTILLED_SIGMA_VALUES, generator=generator, guidance_scale=1.0, output_type="np", return_dict=False, ) encode_video( video[0], fps=frame_rate, audio=audio[0].float().cpu(), audio_sample_rate=pipe.vocoder.config.output_sampling_rate, output_path="ltx2_distilled_sample.mp4", ) ``` ## Condition Pipeline Generation You can use `LTX2ConditionPipeline` to specify image and/or video conditions at arbitrary latent indices. For example, we can specify both a first-frame and last-frame condition to perform first-last-frame-to-video (FLF2V) generation: ```py import torch from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES from diffusers.pipelines.ltx2.export_utils import encode_video from diffusers.utils import load_image device = "cuda" width = 768 height = 512 random_seed = 42 generator = torch.Generator(device).manual_seed(random_seed) model_path = "rootonchair/LTX-2-19b-distilled" pipe = LTX2ConditionPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16) pipe.enable_sequential_cpu_offload(device=device) pipe.vae.enable_tiling() prompt = ( "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are " "delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright " "sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, " "low-angle perspective." ) first_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png", ) last_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png", ) first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0) last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0) conditions = [first_cond, last_cond] frame_rate = 24.0 video_latent, audio_latent = pipe( conditions=conditions, prompt=prompt, width=width, height=height, num_frames=121, frame_rate=frame_rate, num_inference_steps=8, sigmas=DISTILLED_SIGMA_VALUES, guidance_scale=1.0, generator=generator, output_type="latent", return_dict=False, ) latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained( model_path, subfolder="latent_upsampler", torch_dtype=torch.bfloat16, ) upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler) upsample_pipe.enable_model_cpu_offload(device=device) upscaled_video_latent = upsample_pipe( latents=video_latent, output_type="latent", return_dict=False, )[0] video, audio = pipe( latents=upscaled_video_latent, audio_latents=audio_latent, prompt=prompt, width=width * 2, height=height * 2, num_inference_steps=3, sigmas=STAGE_2_DISTILLED_SIGMA_VALUES, generator=generator, guidance_scale=1.0, output_type="np", return_dict=False, ) encode_video( video[0], fps=frame_rate, audio=audio[0].float().cpu(), audio_sample_rate=pipe.vocoder.config.output_sampling_rate, output_path="ltx2_distilled_flf2v.mp4", ) ``` You can use both image and video conditions: ```py import torch from diffusers import LTX2ConditionPipeline from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition from diffusers.pipelines.ltx2.export_utils import encode_video from diffusers.utils import load_image, load_video device = "cuda" width = 768 height = 512 random_seed = 42 generator = torch.Generator(device).manual_seed(random_seed) model_path = "rootonchair/LTX-2-19b-distilled" pipe = LTX2ConditionPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16) pipe.enable_sequential_cpu_offload(device=device) pipe.vae.enable_tiling() prompt = ( "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is " "divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features " "dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered " "clouds, suggesting a bright, sunny day. And then the camera switch to a winding mountain road covered in snow, " "with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The " "landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the " "solitude and beauty of a winter drive through a mountainous region." ) negative_prompt = ( "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, " "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, " "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, " "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of " "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent " "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny " "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, " "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, " "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward " "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, " "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." ) cond_video = load_video( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" ) cond_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg" ) video_cond = LTX2VideoCondition(frames=cond_video, index=0, strength=1.0) image_cond = LTX2VideoCondition(frames=cond_image, index=8, strength=1.0) conditions = [video_cond, image_cond] frame_rate = 24.0 video, audio = pipe( conditions=conditions, prompt=prompt, negative_prompt=negative_prompt, width=width, height=height, num_frames=121, frame_rate=frame_rate, num_inference_steps=40, guidance_scale=4.0, generator=generator, output_type="np", return_dict=False, ) encode_video( video[0], fps=frame_rate, audio=audio[0].float().cpu(), audio_sample_rate=pipe.vocoder.config.output_sampling_rate, output_path="ltx2_cond_video.mp4", ) ``` Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static. ## LTX2Pipeline [[autodoc]] LTX2Pipeline - all - __call__ ## LTX2ImageToVideoPipeline [[autodoc]] LTX2ImageToVideoPipeline - all - __call__ ## LTX2ConditionPipeline [[autodoc]] LTX2ConditionPipeline - all - __call__ ## LTX2LatentUpsamplePipeline [[autodoc]] LTX2LatentUpsamplePipeline - all - __call__ ## LTX2PipelineOutput [[autodoc]] pipelines.ltx2.pipeline_output.LTX2PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/ltx_video.md ================================================ # LTX-Video [LTX-Video](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer designed for fast and real-time generation of high-resolution videos from text and images. The main feature of LTX-Video is the Video-VAE. The Video-VAE has a higher pixel to latent compression ratio (1:192) which enables more efficient video data processing and faster generation speed. To support and prevent finer details from being lost during generation, the Video-VAE decoder performs the latent to pixel conversion *and* the last denoising step. You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization. > [!TIP] > Click on the LTX-Video models in the right sidebar for more examples of other video generation tasks. The example below demonstrates how to generate a video optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The LTX-Video model below requires ~10GB of VRAM. ```py import torch from diffusers import LTXPipeline, AutoModel from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video # fp8 layerwise weight-casting transformer = AutoModel.from_pretrained( "Lightricks/LTX-Video", subfolder="transformer", torch_dtype=torch.bfloat16 ) transformer.enable_layerwise_casting( storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16 ) pipeline = LTXPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16) # group-offloading onload_device = torch.device("cuda") offload_device = torch.device("cpu") pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True) apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2) apply_group_offloading(pipeline.vae, onload_device=onload_device, offload_type="leaf_level") prompt = """ A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage """ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" video = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=768, height=512, num_frames=161, decode_timestep=0.03, decode_noise_scale=0.025, num_inference_steps=50, ).frames[0] export_to_video(video, "output.mp4", fps=24) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ```py import torch from diffusers import LTXPipeline from diffusers.utils import export_to_video pipeline = LTXPipeline.from_pretrained( "Lightricks/LTX-Video", torch_dtype=torch.bfloat16 ) # torch.compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer = torch.compile( pipeline.transformer, mode="max-autotune", fullgraph=True ) prompt = """ A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage """ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" video = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=768, height=512, num_frames=161, decode_timestep=0.03, decode_noise_scale=0.025, num_inference_steps=50, ).frames[0] export_to_video(video, "output.mp4", fps=24) ``` ## Notes - Refer to the following recommended settings for generation from the [LTX-Video](https://github.com/Lightricks/LTX-Video) repository. - The recommended dtype for the transformer, VAE, and text encoder is `torch.bfloat16`. The VAE and text encoder can also be `torch.float32` or `torch.float16`. - For guidance-distilled variants of LTX-Video, set `guidance_scale` to `1.0`. The `guidance_scale` for any other model should be set higher, like `5.0`, for good generation quality. - For timestep-aware VAE variants (LTX-Video 0.9.1 and above), set `decode_timestep` to `0.05` and `image_cond_noise_scale` to `0.025`. - For variants that support interpolation between multiple conditioning images and videos (LTX-Video 0.9.5 and above), use similar images and videos for the best results. Divergence from the conditioning inputs may lead to abrupt transitions in the generated video. - LTX-Video 0.9.7 includes a spatial latent upscaler and a 13B parameter transformer. During inference, a low resolution video is quickly generated first and then upscaled and refined.
Show example code ```py import torch from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition from diffusers.utils import export_to_video, load_video pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16) pipeline_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) pipeline.to("cuda") pipe_upsample.to("cuda") pipeline.vae.enable_tiling() def round_to_nearest_resolution_acceptable_by_vae(height, width): height = height - (height % pipeline.vae_temporal_compression_ratio) width = width - (width % pipeline.vae_temporal_compression_ratio) return height, width video = load_video( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" )[:21] # only use the first 21 frames as conditioning condition1 = LTXVideoCondition(video=video, frame_index=0) prompt = """ The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region. """ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" expected_height, expected_width = 768, 1152 downscale_factor = 2 / 3 num_frames = 161 # 1. Generate video at smaller resolution # Text-only conditioning is also supported without the need to pass `conditions` downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) latents = pipeline( conditions=[condition1], prompt=prompt, negative_prompt=negative_prompt, width=downscaled_width, height=downscaled_height, num_frames=num_frames, num_inference_steps=30, decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=5.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="latent", ).frames # 2. Upscale generated video using latent upsampler with fewer inference steps # The available latent upsampler upscales the height/width by 2x upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 upscaled_latents = pipe_upsample( latents=latents, output_type="latent" ).frames # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) video = pipeline( conditions=[condition1], prompt=prompt, negative_prompt=negative_prompt, width=upscaled_width, height=upscaled_height, num_frames=num_frames, denoise_strength=0.4, # Effectively, 4 inference steps out of 10 num_inference_steps=10, latents=upscaled_latents, decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=5.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="pil", ).frames[0] # 4. Downscale the video to the expected resolution video = [frame.resize((expected_width, expected_height)) for frame in video] export_to_video(video, "output.mp4", fps=24) ```
- LTX-Video 0.9.7 distilled model is guidance and timestep-distilled to speedup generation. It requires `guidance_scale` to be set to `1.0` and `num_inference_steps` should be set between `4` and `10` for good generation quality. You should also use the following custom timesteps for the best results. - Base model inference to prepare for upscaling: `[1000, 993, 987, 981, 975, 909, 725, 0.03]`. - Upscaling: `[1000, 909, 725, 421, 0]`.
Show example code ```py import torch from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition from diffusers.utils import export_to_video, load_video pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-distilled", torch_dtype=torch.bfloat16) pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipeline.vae, torch_dtype=torch.bfloat16) pipeline.to("cuda") pipe_upsample.to("cuda") pipeline.vae.enable_tiling() def round_to_nearest_resolution_acceptable_by_vae(height, width): height = height - (height % pipeline.vae_spatial_compression_ratio) width = width - (width % pipeline.vae_spatial_compression_ratio) return height, width prompt = """ artistic anatomical 3d render, utlra quality, human half full male body with transparent skin revealing structure instead of organs, muscular, intricate creative patterns, monochromatic with backlighting, lightning mesh, scientific concept art, blending biology with botany, surreal and ethereal quality, unreal engine 5, ray tracing, ultra realistic, 16K UHD, rich details. camera zooms out in a rotating fashion """ negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted" expected_height, expected_width = 768, 1152 downscale_factor = 2 / 3 num_frames = 161 # 1. Generate video at smaller resolution downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) latents = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=downscaled_width, height=downscaled_height, num_frames=num_frames, timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03], decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=1.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="latent", ).frames # 2. Upscale generated video using latent upsampler with fewer inference steps # The available latent upsampler upscales the height/width by 2x upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 upscaled_latents = pipe_upsample( latents=latents, adain_factor=1.0, output_type="latent" ).frames # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) video = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=upscaled_width, height=upscaled_height, num_frames=num_frames, denoise_strength=0.999, # Effectively, 4 inference steps out of 5 timesteps=[1000, 909, 725, 421, 0], latents=upscaled_latents, decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=1.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="pil", ).frames[0] # 4. Downscale the video to the expected resolution video = [frame.resize((expected_width, expected_height)) for frame in video] export_to_video(video, "output.mp4", fps=24) ```
- LTX-Video 0.9.8 distilled model is similar to the 0.9.7 variant. It is guidance and timestep-distilled, and similar inference code can be used as above. An improvement of this version is that it supports generating very long videos. Additionally, it supports using tone mapping to improve the quality of the generated video using the `tone_map_compression_ratio` parameter. The default value of `0.6` is recommended.
Show example code ```python import torch from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel from diffusers.utils import export_to_video, load_video pipeline = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.8-13B-distilled", torch_dtype=torch.bfloat16) # TODO: Update the checkpoint here once updated in LTX org upsampler = LTXLatentUpsamplerModel.from_pretrained("a-r-r-o-w/LTX-0.9.8-Latent-Upsampler", torch_dtype=torch.bfloat16) pipe_upsample = LTXLatentUpsamplePipeline(vae=pipeline.vae, latent_upsampler=upsampler).to(torch.bfloat16) pipeline.to("cuda") pipe_upsample.to("cuda") pipeline.vae.enable_tiling() def round_to_nearest_resolution_acceptable_by_vae(height, width): height = height - (height % pipeline.vae_spatial_compression_ratio) width = width - (width % pipeline.vae_spatial_compression_ratio) return height, width prompt = """The camera pans over a snow-covered mountain range, revealing a vast expanse of snow-capped peaks and valleys.The mountains are covered in a thick layer of snow, with some areas appearing almost white while others have a slightly darker, almost grayish hue. The peaks are jagged and irregular, with some rising sharply into the sky while others are more rounded. The valleys are deep and narrow, with steep slopes that are also covered in snow. The trees in the foreground are mostly bare, with only a few leaves remaining on their branches. The sky is overcast, with thick clouds obscuring the sun. The overall impression is one of peace and tranquility, with the snow-covered mountains standing as a testament to the power and beauty of nature.""" # prompt = """A woman walks away from a white Jeep parked on a city street at night, then ascends a staircase and knocks on a door. The woman, wearing a dark jacket and jeans, walks away from the Jeep parked on the left side of the street, her back to the camera; she walks at a steady pace, her arms swinging slightly by her sides; the street is dimly lit, with streetlights casting pools of light on the wet pavement; a man in a dark jacket and jeans walks past the Jeep in the opposite direction; the camera follows the woman from behind as she walks up a set of stairs towards a building with a green door; she reaches the top of the stairs and turns left, continuing to walk towards the building; she reaches the door and knocks on it with her right hand; the camera remains stationary, focused on the doorway; the scene is captured in real-life footage.""" negative_prompt = "bright colors, symbols, graffiti, watermarks, worst quality, inconsistent motion, blurry, jittery, distorted" expected_height, expected_width = 480, 832 downscale_factor = 2 / 3 # num_frames = 161 num_frames = 361 # 1. Generate video at smaller resolution downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor) downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width) latents = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=downscaled_width, height=downscaled_height, num_frames=num_frames, timesteps=[1000, 993, 987, 981, 975, 909, 725, 0.03], decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=1.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="latent", ).frames # 2. Upscale generated video using latent upsampler with fewer inference steps # The available latent upsampler upscales the height/width by 2x upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2 upscaled_latents = pipe_upsample( latents=latents, adain_factor=1.0, tone_map_compression_ratio=0.6, output_type="latent" ).frames # 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended) video = pipeline( prompt=prompt, negative_prompt=negative_prompt, width=upscaled_width, height=upscaled_height, num_frames=num_frames, denoise_strength=0.999, # Effectively, 4 inference steps out of 5 timesteps=[1000, 909, 725, 421, 0], latents=upscaled_latents, decode_timestep=0.05, decode_noise_scale=0.025, image_cond_noise_scale=0.0, guidance_scale=1.0, guidance_rescale=0.7, generator=torch.Generator().manual_seed(0), output_type="pil", ).frames[0] # 4. Downscale the video to the expected resolution video = [frame.resize((expected_width, expected_height)) for frame in video] export_to_video(video, "output.mp4", fps=24) ```
- LTX-Video supports LoRAs with [`~loaders.LTXVideoLoraLoaderMixin.load_lora_weights`].
Show example code ```py import torch from diffusers import LTXConditionPipeline from diffusers.utils import export_to_video, load_image pipeline = LTXConditionPipeline.from_pretrained( "Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16 ) pipeline.load_lora_weights("Lightricks/LTX-Video-Cakeify-LoRA", adapter_name="cakeify") pipeline.set_adapters("cakeify") # use "CAKEIFY" to trigger the LoRA prompt = "CAKEIFY a person using a knife to cut a cake shaped like a Pikachu plushie" image = load_image("https://huggingface.co/Lightricks/LTX-Video-Cakeify-LoRA/resolve/main/assets/images/pikachu.png") video = pipeline( prompt=prompt, image=image, width=576, height=576, num_frames=161, decode_timestep=0.03, decode_noise_scale=0.025, num_inference_steps=50, ).frames[0] export_to_video(video, "output.mp4", fps=26) ```
- LTX-Video supports loading from single files, such as [GGUF checkpoints](../../quantization/gguf), with [`loaders.FromOriginalModelMixin.from_single_file`] or [`loaders.FromSingleFileMixin.from_single_file`].
Show example code ```py import torch from diffusers.utils import export_to_video from diffusers import LTXPipeline, AutoModel, GGUFQuantizationConfig transformer = AutoModel.from_single_file( "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf" quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16 ) pipeline = LTXPipeline.from_pretrained( "Lightricks/LTX-Video", transformer=transformer, torch_dtype=torch.bfloat16 ) ```
## LTXI2VLongMultiPromptPipeline [[autodoc]] LTXI2VLongMultiPromptPipeline - all - __call__ ## LTXPipeline [[autodoc]] LTXPipeline - all - __call__ ## LTXImageToVideoPipeline [[autodoc]] LTXImageToVideoPipeline - all - __call__ ## LTXConditionPipeline [[autodoc]] LTXConditionPipeline - all - __call__ ## LTXLatentUpsamplePipeline [[autodoc]] LTXLatentUpsamplePipeline - all - __call__ ## LTXPipelineOutput [[autodoc]] pipelines.ltx.pipeline_output.LTXPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/lumina.md ================================================ # Lumina-T2X ![concepts](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a) [Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/assets/lumina-next.pdf) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. The abstract from the paper is: *Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers (Flag-DiT) that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduce a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights at https://github.com/Alpha-VLLM/Lumina-T2X, we aim to advance the development of next-generation generative AI capable of universal modeling.* **Highlights**: Lumina-Next is a next-generation Diffusion Transformer that significantly enhances text-to-image generation, multilingual generation, and multitask performance by introducing the Next-DiT architecture, 3D RoPE, and frequency- and time-aware RoPE, among other improvements. Lumina-Next has the following components: * It improves sampling efficiency with fewer and faster Steps. * It uses a Next-DiT as a transformer backbone with Sandwichnorm 3D RoPE, and Grouped-Query Attention. * It uses a Frequency- and Time-Aware Scaled RoPE. --- [Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](https://huggingface.co/papers/2405.05945) from Alpha-VLLM, OpenGVLab, Shanghai AI Laboratory. The abstract from the paper is: *Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.* You can find the original codebase at [Alpha-VLLM](https://github.com/Alpha-VLLM/Lumina-T2X) and all the available checkpoints at [Alpha-VLLM Lumina Family](https://huggingface.co/collections/Alpha-VLLM/lumina-family-66423205bedb81171fd0644b). **Highlights**: Lumina-T2X supports Any Modality, Resolution, and Duration. Lumina-T2X has the following components: * It uses a Flow-based Large Diffusion Transformer as the backbone * It supports different any modalities with one backbone and corresponding encoder, decoder. This pipeline was contributed by [PommesPeter](https://github.com/PommesPeter). The original codebase can be found [here](https://github.com/Alpha-VLLM/Lumina-T2X). The original weights can be found under [hf.co/Alpha-VLLM](https://huggingface.co/Alpha-VLLM). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ### Inference (Text-to-Image) Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. First, load the pipeline: ```python from diffusers import LuminaPipeline import torch pipeline = LuminaPipeline.from_pretrained( "Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16 ).to("cuda") ``` Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: ```python pipeline.transformer.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) ``` Finally, compile the components and run inference: ```python pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures").images[0] ``` ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LuminaPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, Transformer2DModel, LuminaPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "Alpha-VLLM/Lumina-Next-SFT-diffusers", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = Transformer2DModel.from_pretrained( "Alpha-VLLM/Lumina-Next-SFT-diffusers", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = LuminaPipeline.from_pretrained( "Alpha-VLLM/Lumina-Next-SFT-diffusers", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt).images[0] image.save("lumina.png") ``` ## LuminaPipeline [[autodoc]] LuminaPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/lumina2.md ================================================ # Lumina2
LoRA
[Lumina Image 2.0: A Unified and Efficient Image Generative Model](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) is a 2 billion parameter flow-based diffusion transformer capable of generating diverse images from text descriptions. The abstract from the paper is: *We introduce Lumina-Image 2.0, an advanced text-to-image model that surpasses previous state-of-the-art methods across multiple benchmarks, while also shedding light on its potential to evolve into a generalist vision intelligence model. Lumina-Image 2.0 exhibits three key properties: (1) Unification – it adopts a unified architecture that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and facilitating task expansion. Besides, since high-quality captioners can provide semantically better-aligned text-image training pairs, we introduce a unified captioning system, UniCaptioner, which generates comprehensive and precise captions for the model. This not only accelerates model convergence but also enhances prompt adherence, variable-length prompt handling, and task generalization via prompt templates. (2) Efficiency – to improve the efficiency of the unified architecture, we develop a set of optimization techniques that improve semantic learning and fine-grained texture generation during training while incorporating inference-time acceleration strategies without compromising image quality. (3) Transparency – we open-source all training details, code, and models to ensure full reproducibility, aiming to bridge the gap between well-resourced closed-source research teams and independent developers.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Using Single File loading with Lumina Image 2.0 Single file loading for Lumina Image 2.0 is available for the `Lumina2Transformer2DModel` ```python import torch from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0/blob/main/consolidated.00-of-01.pth" transformer = Lumina2Transformer2DModel.from_single_file( ckpt_path, torch_dtype=torch.bfloat16 ) pipe = Lumina2Pipeline.from_pretrained( "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() image = pipe( "a cat holding a sign that says hello", generator=torch.Generator("cpu").manual_seed(0), ).images[0] image.save("lumina-single-file.png") ``` ## Using GGUF Quantized Checkpoints with Lumina Image 2.0 GGUF Quantized checkpoints for the `Lumina2Transformer2DModel` can be loaded via `from_single_file` with the `GGUFQuantizationConfig` ```python from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline, GGUFQuantizationConfig ckpt_path = "https://huggingface.co/calcuis/lumina-gguf/blob/main/lumina2-q4_0.gguf" transformer = Lumina2Transformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipe = Lumina2Pipeline.from_pretrained( "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() image = pipe( "a cat holding a sign that says hello", generator=torch.Generator("cpu").manual_seed(0), ).images[0] image.save("lumina-gguf.png") ``` ## Lumina2Pipeline [[autodoc]] Lumina2Pipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/marigold.md ================================================ # Marigold Computer Vision ![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg) Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks**. This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the teaser above. Marigold was later extended in the follow-up paper, [Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145), authored by [Bingxin Ke](http://www.kebingxin.com/), [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US), [Tianfu Wang](https://tianfwang.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Shengyu Huang](https://shengyuh.github.io/), [Bo Li](https://www.linkedin.com/in/bobboli0202/), [Anton Obukhov](https://www.obukhov.ai/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en). This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition** (IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR) processing capability. > [!TIP] > The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps. > LCM models were later developed to enable high-quality inference in just 1 to 4 steps. > Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal > results in as few as 1 to 4 steps. ## Available Pipelines Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a corresponding prediction. Currently, the following computer vision tasks are implemented: | Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities | |---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | | [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | | [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),
[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) | ## Available Checkpoints All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face. They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train new model checkpoints. The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps. | Checkpoint | Modality | Comment | |-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. | | [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. | | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. | | [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image $I$ is comprised of Albedo $A$, Diffuse shading $S$, and Non-diffuse residual $R$: $I = A*S+R$. | > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff > between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to > efficiently load the same components into multiple pipelines. > Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section > [here](../../using-diffusers/svd#reduce-memory-usage). > [!WARNING] > Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint. > The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases. > To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the > API reference). > Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration > file (`model_index.json`). > This ensures high-quality predictions when invoking the pipeline with only the `image` argument. See also Marigold [usage examples](../../using-diffusers/marigold_usage). ## Marigold Depth Prediction API [[autodoc]] MarigoldDepthPipeline - __call__ [[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput [[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth ## Marigold Normals Estimation API [[autodoc]] MarigoldNormalsPipeline - __call__ [[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput [[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals ## Marigold Intrinsic Image Decomposition API [[autodoc]] MarigoldIntrinsicsPipeline - __call__ [[autodoc]] pipelines.marigold.pipeline_marigold_intrinsics.MarigoldIntrinsicsOutput [[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics ================================================ FILE: docs/source/en/api/pipelines/mochi.md ================================================ # Mochi 1 Preview
LoRA
> [!TIP] > Only a research preview of the model weights is available at the moment. [Mochi 1](https://huggingface.co/genmo/mochi-1-preview) is a video generation model by Genmo with a strong focus on prompt adherence and motion quality. The model features a 10B parameter Asmmetric Diffusion Transformer (AsymmDiT) architecture, and uses non-square QKV and output projection layers to reduce inference memory requirements. A single T5-XXL model is used to encode prompts. *Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. The model is released under a permissive Apache 2.0 license.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`MochiPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, MochiTransformer3DModel, MochiPipeline from diffusers.utils import export_to_video from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "genmo/mochi-1-preview", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = MochiTransformer3DModel.from_pretrained( "genmo/mochi-1-preview", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = MochiPipeline.from_pretrained( "genmo/mochi-1-preview", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) video = pipeline( "Close-up of a cats eye, with the galaxy reflected in the cats eye. Ultra high resolution 4k.", num_inference_steps=28, guidance_scale=3.5 ).frames[0] export_to_video(video, "cat.mp4") ``` ## Generating videos with Mochi-1 Preview The following example will download the full precision `mochi-1-preview` weights and produce the highest quality results but will require at least 42GB VRAM to run. ```python import torch from diffusers import MochiPipeline from diffusers.utils import export_to_video pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview") # Enable memory savings pipe.enable_model_cpu_offload() pipe.enable_vae_tiling() prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k." with torch.autocast("cuda", torch.bfloat16, cache_enabled=False): frames = pipe(prompt, num_frames=85).frames[0] export_to_video(frames, "mochi.mp4", fps=30) ``` ## Using a lower precision variant to save memory The following example will use the `bfloat16` variant of the model and requires 22GB VRAM to run. There is a slight drop in the quality of the generated video as a result. ```python import torch from diffusers import MochiPipeline from diffusers.utils import export_to_video pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16) # Enable memory savings pipe.enable_model_cpu_offload() pipe.enable_vae_tiling() prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k." frames = pipe(prompt, num_frames=85).frames[0] export_to_video(frames, "mochi.mp4", fps=30) ``` ## Reproducing the results from the Genmo Mochi repo The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the original implementation, please refer to the following example. > [!TIP] > The original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder. > > When enabling `force_zeros_for_empty_prompt`, it is recommended to run the text encoding step outside the autocast context in full precision. > [!TIP] > Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`. ```python import torch from torch.nn.attention import SDPBackend, sdpa_kernel from diffusers import MochiPipeline from diffusers.utils import export_to_video from diffusers.video_processor import VideoProcessor pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", force_zeros_for_empty_prompt=True) pipe.enable_vae_tiling() pipe.enable_model_cpu_offload() prompt = "An aerial shot of a parade of elephants walking across the African savannah. The camera showcases the herd and the surrounding landscape." with torch.no_grad(): prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask = ( pipe.encode_prompt(prompt=prompt) ) with torch.autocast("cuda", torch.bfloat16): with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): frames = pipe( prompt_embeds=prompt_embeds, prompt_attention_mask=prompt_attention_mask, negative_prompt_embeds=negative_prompt_embeds, negative_prompt_attention_mask=negative_prompt_attention_mask, guidance_scale=4.5, num_inference_steps=64, height=480, width=848, num_frames=163, generator=torch.Generator("cuda").manual_seed(0), output_type="latent", return_dict=False, )[0] video_processor = VideoProcessor(vae_scale_factor=8) has_latents_mean = hasattr(pipe.vae.config, "latents_mean") and pipe.vae.config.latents_mean is not None has_latents_std = hasattr(pipe.vae.config, "latents_std") and pipe.vae.config.latents_std is not None if has_latents_mean and has_latents_std: latents_mean = ( torch.tensor(pipe.vae.config.latents_mean).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype) ) latents_std = ( torch.tensor(pipe.vae.config.latents_std).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype) ) frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean else: frames = frames / pipe.vae.config.scaling_factor with torch.no_grad(): video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0] video = video_processor.postprocess_video(video)[0] export_to_video(video, "mochi.mp4", fps=30) ``` ## Running inference with multiple GPUs It is possible to split the large Mochi transformer across multiple GPUs using the `device_map` and `max_memory` options in `from_pretrained`. In the following example we split the model across two GPUs, each with 24GB of VRAM. ```python import torch from diffusers import MochiPipeline, MochiTransformer3DModel from diffusers.utils import export_to_video model_id = "genmo/mochi-1-preview" transformer = MochiTransformer3DModel.from_pretrained( model_id, subfolder="transformer", device_map="auto", max_memory={0: "24GB", 1: "24GB"} ) pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer) pipe.enable_model_cpu_offload() pipe.enable_vae_tiling() with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False): frames = pipe( prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.", negative_prompt="", height=480, width=848, num_frames=85, num_inference_steps=50, guidance_scale=4.5, num_videos_per_prompt=1, generator=torch.Generator(device="cuda").manual_seed(0), max_sequence_length=256, output_type="pil", ).frames[0] export_to_video(frames, "output.mp4", fps=30) ``` ## Using single file loading with the Mochi Transformer You can use `from_single_file` to load the Mochi transformer in its original format. > [!TIP] > Diffusers currently doesn't support using the FP8 scaled versions of the Mochi single file checkpoints. ```python import torch from diffusers import MochiPipeline, MochiTransformer3DModel from diffusers.utils import export_to_video model_id = "genmo/mochi-1-preview" ckpt_path = "https://huggingface.co/Comfy-Org/mochi_preview_repackaged/blob/main/split_files/diffusion_models/mochi_preview_bf16.safetensors" transformer = MochiTransformer3DModel.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16) pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer) pipe.enable_model_cpu_offload() pipe.enable_vae_tiling() with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False): frames = pipe( prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.", negative_prompt="", height=480, width=848, num_frames=85, num_inference_steps=50, guidance_scale=4.5, num_videos_per_prompt=1, generator=torch.Generator(device="cuda").manual_seed(0), max_sequence_length=256, output_type="pil", ).frames[0] export_to_video(frames, "output.mp4", fps=30) ``` ## MochiPipeline [[autodoc]] MochiPipeline - all - __call__ ## MochiPipelineOutput [[autodoc]] pipelines.mochi.pipeline_output.MochiPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/musicldm.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # MusicLDM MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. MusicLDM takes a text prompt as input and predicts the corresponding music sample. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm), MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style. The abstract of the paper is the following: *Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.* This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). ## Tips When constructing a prompt, keep in mind: * Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). * Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". During inference: * The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. * The _length_ of the generated audio sample can be controlled by varying the `audio_length_in_s` argument. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## MusicLDMPipeline [[autodoc]] MusicLDMPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/omnigen.md ================================================ # OmniGen [OmniGen: Unified Image Generation](https://huggingface.co/papers/2409.11340) from BAAI, by Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu. The abstract from the paper is: *The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [staoxiao](https://github.com/staoxiao). The original codebase can be found [here](https://github.com/VectorSpaceLab/OmniGen). The original weights can be found under [hf.co/shitao](https://huggingface.co/Shitao/OmniGen-v1). ## Inference First, load the pipeline: ```python import torch from diffusers import OmniGenPipeline pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16) pipe.to("cuda") ``` For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image. You can try setting the `height` and `width` parameters to generate images with different size. ```python prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD." image = pipe( prompt=prompt, height=1024, width=1024, guidance_scale=3, generator=torch.Generator(device="cpu").manual_seed(111), ).images[0] image.save("output.png") ``` OmniGen supports multimodal inputs. When the input includes an image, you need to add a placeholder `<|image_1|>` in the text prompt to represent the image. It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image. ```python prompt="<|image_1|> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola." input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")] image = pipe( prompt=prompt, input_images=input_images, guidance_scale=2, img_guidance_scale=1.6, use_input_image_size_as_output=True, generator=torch.Generator(device="cpu").manual_seed(222)).images[0] image.save("output.png") ``` ## OmniGenPipeline [[autodoc]] OmniGenPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/overview.md ================================================ # Pipelines Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components. All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline. > [!WARNING] > You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead. > >
> > Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead! The table below lists all the pipelines currently available in 🤗 Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper. | Pipeline | Tasks | |---|---| | [aMUSEd](amused) | text2image | | [AnimateDiff](animatediff) | text2video | | [Attend-and-Excite](attend_and_excite) | text2image | | [AudioLDM](audioldm) | text2audio | | [AudioLDM2](audioldm2) | text2audio | | [AuraFlow](aura_flow) | text2image | | [BLIP Diffusion](blip_diffusion) | text2image | | [Bria 3.2](bria_3_2) | text2image | | [CogVideoX](cogvideox) | text2video | | [Consistency Models](consistency_models) | unconditional image generation | | [ControlNet](controlnet) | text2image, image2image, inpainting | | [ControlNet with Flux.1](controlnet_flux) | text2image | | [ControlNet with Hunyuan-DiT](controlnet_hunyuandit) | text2image | | [ControlNet with Stable Diffusion 3](controlnet_sd3) | text2image | | [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image | | [ControlNet-XS](controlnetxs) | text2image | | [ControlNet-XS with Stable Diffusion XL](controlnetxs_sdxl) | text2image | | [Cosmos](cosmos) | text2video, video2video | | [Dance Diffusion](dance_diffusion) | unconditional audio generation | | [DDIM](ddim) | unconditional image generation | | [DDPM](ddpm) | unconditional image generation | | [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution | | [DiffEdit](diffedit) | inpainting | | [DiT](dit) | text2image | | [Flux](flux) | text2image | | [Hunyuan-DiT](hunyuandit) | text2image | | [I2VGen-XL](i2vgenxl) | image2video | | [InstructPix2Pix](pix2pix) | image editing | | [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation | | [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting | | [Kandinsky 3](kandinsky3) | text2image, image2image | | [Kolors](kolors) | text2image | | [Latent Consistency Models](latent_consistency_models) | text2image | | [Latent Diffusion](latent_diffusion) | text2image, super-resolution | | [Latte](latte) | text2image | | [LEDITS++](ledits_pp) | image editing | | [Lumina-T2X](lumina) | text2image | | [Marigold](marigold) | depth-estimation, normals-estimation, intrinsic-decomposition | | [MultiDiffusion](panorama) | text2image | | [MusicLDM](musicldm) | text2audio | | [PAG](pag) | text2image | | [Paint by Example](paint_by_example) | inpainting | | [PIA](pia) | image2video | | [PixArt-α](pixart) | text2image | | [PixArt-Σ](pixart_sigma) | text2image | | [Self-Attention Guidance](self_attention_guidance) | text2image | | [Semantic Guidance](semantic_stable_diffusion) | text2image | | [Shap-E](shap_e) | text-to-3D, image-to-3D | | [Stable Audio](stable_audio) | text2audio | | [Stable Cascade](stable_cascade) | text2image | | [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution | | [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting | | [Stable Diffusion XL Turbo](stable_diffusion/sdxl_turbo) | text2image, image2image, inpainting | | [Stable unCLIP](stable_unclip) | text2image, image variation | | [T2I-Adapter](stable_diffusion/adapter) | text2image | | [Text2Video](text_to_video) | text2video, video2video | | [Text2Video-Zero](text_to_video_zero) | text2video | | [unCLIP](unclip) | text2image, image variation | | [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation | | [Value-guided planning](value_guided_sampling) | value guided sampling | | [Wuerstchen](wuerstchen) | text2image | | [VisualCloze](visualcloze) | text2image, image2image, subject driven generation, inpainting, style transfer, image restoration, image editing, [depth,normal,edge,pose]2image, [depth,normal,edge,pose]-estimation, virtual try-on, image relighting | ## DiffusionPipeline [[autodoc]] DiffusionPipeline - all - __call__ - device - to - components [[autodoc]] pipelines.StableDiffusionMixin.enable_freeu [[autodoc]] pipelines.StableDiffusionMixin.disable_freeu ## PushToHubMixin [[autodoc]] utils.PushToHubMixin ## Callbacks [[autodoc]] callbacks.PipelineCallback [[autodoc]] callbacks.SDCFGCutoffCallback [[autodoc]] callbacks.SDXLCFGCutoffCallback [[autodoc]] callbacks.SDXLControlnetCFGCutoffCallback [[autodoc]] callbacks.IPAdapterScaleCutoffCallback [[autodoc]] callbacks.SD3CFGCutoffCallback ================================================ FILE: docs/source/en/api/pipelines/ovis_image.md ================================================ # Ovis-Image ![concepts](https://github.com/AIDC-AI/Ovis-Image/blob/main/docs/imgs/ovis_image_case.png) Ovis-Image is a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. [Ovis-Image Technical Report](https://arxiv.org/abs/2511.22982) from Alibaba Group, by Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen. The abstract from the paper is: *We introduce Ovis-Image, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.* **Highlights**: * **Strong text rendering at a compact 7B scale**: Ovis-Image is a 7B text-to-image model that delivers text rendering quality comparable to much larger 20B-class systems such as Qwen-Image and competitive with leading closed-source models like GPT4o in text-centric scenarios, while remaining small enough to run on widely accessible hardware. * **High fidelity on text-heavy, layout-sensitive prompts**: The model excels on prompts that demand tight alignment between linguistic content and rendered typography (e.g., posters, banners, logos, UI mockups, infographics), producing legible, correctly spelled, and semantically consistent text across diverse fonts, sizes, and aspect ratios without compromising overall visual quality. * **Efficiency and deployability**: With its 7B parameter budget and streamlined architecture, Ovis-Image fits on a single high-end GPU with moderate memory, supports low-latency interactive use, and scales to batch production serving, bringing near–frontier text rendering to applications where tens-of-billions–parameter models are impractical. This pipeline was contributed by Ovis-Image Team. The original codebase can be found [here](https://github.com/AIDC-AI/Ovis-Image). Available models: | Model | Recommended dtype | |:-----:|:-----------------:| | [`AIDC-AI/Ovis-Image-7B`](https://huggingface.co/AIDC-AI/Ovis-Image-7B) | `torch.bfloat16` | Refer to [this](https://huggingface.co/collections/AIDC-AI/ovis-image) collection for more information. ## OvisImagePipeline [[autodoc]] OvisImagePipeline - all - __call__ ## OvisImagePipelineOutput [[autodoc]] pipelines.ovis_image.pipeline_output.OvisImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/pag.md ================================================ # Perturbed-Attention Guidance
LoRA
[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules. PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim. The abstract from the paper is: *Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.* PAG can be used by specifying the `pag_applied_layers` as a parameter when instantiating a PAG pipeline. It can be a single string or a list of strings. Each string can be a unique layer identifier or a regular expression to identify one or more layers. - Full identifier as a normal string: `down_blocks.2.attentions.0.transformer_blocks.0.attn1.processor` - Full identifier as a RegEx: `down_blocks.2.(attentions|motion_modules).0.transformer_blocks.0.attn1.processor` - Partial identifier as a RegEx: `down_blocks.2`, or `attn1` - List of identifiers (can be combo of strings and ReGex): `["blocks.1", "blocks.(14|20)", r"down_blocks\.(2,3)"]` > [!WARNING] > Since RegEx is supported as a way for matching layer identifiers, it is crucial to use it correctly otherwise there might be unexpected behaviour. The recommended way to use PAG is by specifying layers as `blocks.{layer_index}` and `blocks.({layer_index_1|layer_index_2|...})`. Using it in any other way, while doable, may bypass our basic validation checks and give you unexpected results. ## AnimateDiffPAGPipeline [[autodoc]] AnimateDiffPAGPipeline - all - __call__ ## HunyuanDiTPAGPipeline [[autodoc]] HunyuanDiTPAGPipeline - all - __call__ ## KolorsPAGPipeline [[autodoc]] KolorsPAGPipeline - all - __call__ ## StableDiffusionPAGInpaintPipeline [[autodoc]] StableDiffusionPAGInpaintPipeline - all - __call__ ## StableDiffusionPAGPipeline [[autodoc]] StableDiffusionPAGPipeline - all - __call__ ## StableDiffusionPAGImg2ImgPipeline [[autodoc]] StableDiffusionPAGImg2ImgPipeline - all - __call__ ## StableDiffusionControlNetPAGPipeline [[autodoc]] StableDiffusionControlNetPAGPipeline ## StableDiffusionControlNetPAGInpaintPipeline [[autodoc]] StableDiffusionControlNetPAGInpaintPipeline - all - __call__ ## StableDiffusionXLPAGPipeline [[autodoc]] StableDiffusionXLPAGPipeline - all - __call__ ## StableDiffusionXLPAGImg2ImgPipeline [[autodoc]] StableDiffusionXLPAGImg2ImgPipeline - all - __call__ ## StableDiffusionXLPAGInpaintPipeline [[autodoc]] StableDiffusionXLPAGInpaintPipeline - all - __call__ ## StableDiffusionXLControlNetPAGPipeline [[autodoc]] StableDiffusionXLControlNetPAGPipeline - all - __call__ ## StableDiffusionXLControlNetPAGImg2ImgPipeline [[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline - all - __call__ ## StableDiffusion3PAGPipeline [[autodoc]] StableDiffusion3PAGPipeline - all - __call__ ## StableDiffusion3PAGImg2ImgPipeline [[autodoc]] StableDiffusion3PAGImg2ImgPipeline - all - __call__ ## PixArtSigmaPAGPipeline [[autodoc]] PixArtSigmaPAGPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/paint_by_example.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Paint by Example [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. The abstract from the paper is: *Language-guided image editing has achieved great success recently. In this paper, for the first time, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.* The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https://github.com/Fantasy-Studio/Paint-by-Example), and you can try it out in a [demo](https://huggingface.co/spaces/Fantasy-Studio/Paint-by-Example). ## Tips Paint by Example is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## PaintByExamplePipeline [[autodoc]] PaintByExamplePipeline - all - __call__ ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/panorama.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # MultiDiffusion
LoRA
[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. The abstract from the paper is: *Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes.* You can find additional information about MultiDiffusion on the [project page](https://multidiffusion.github.io/), [original codebase](https://github.com/omerbt/MultiDiffusion), and try it out in a [demo](https://huggingface.co/spaces/weizmannscience/MultiDiffusion). ## Tips While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. For some GPUs with high performance, this can speedup the generation process and increase VRAM usage. To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default. Circular padding is applied to ensure there are no stitching artifacts when working with panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set `circular_padding=True`), the operation applies additional crops after the rightmost point of the image, allowing the model to "see” the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper “panorama” that can be viewed using 360-degree panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space. For example, without circular padding, there is a stitching artifact (default): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png) But with circular padding, the right and the left parts are matching (`circular_padding=True`): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20circular_padding.png) > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionPanoramaPipeline [[autodoc]] StableDiffusionPanoramaPipeline - __call__ - all ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/pia.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Image-to-Video Generation with PIA (Personalized Image Animator)
LoRA
## Overview [PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://huggingface.co/papers/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation, empowering non-experts to generate stunning images with unique styles. While promising, adding realistic motions into these personalized images by text poses significant challenges in preserving distinct styles, high-fidelity details, and achieving motion controllability by text. In this paper, we present PIA, a Personalized Image Animator that excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. To achieve these goals, PIA builds upon a base T2I model with well-trained temporal alignment layers, allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information guided by the affinity hint for individual frame synthesis in the latent space. This design mitigates the challenges of appearance-related image alignment within and allows for a stronger focus on aligning with motion-related guidance. [Project page](https://pi-animator.github.io/) ## Available Pipelines | Pipeline | Tasks | Demo |---|---|:---:| | [PIAPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pia/pipeline_pia.py) | *Image-to-Video Generation with PIA* | ## Available checkpoints Motion Adapter checkpoints for PIA can be found under the [OpenMMLab org](https://huggingface.co/openmmlab/PIA-condition-adapter). These checkpoints are meant to work with any model based on Stable Diffusion 1.5 ## Usage example PIA works with a MotionAdapter checkpoint and a Stable Diffusion 1.5 model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in the Stable Diffusion UNet. In addition to the motion modules, PIA also replaces the input convolution layer of the SD 1.5 UNet model with a 9 channel input convolution layer. The following example demonstrates how to use PIA to generate a video from a single image. ```python import torch from diffusers import ( EulerDiscreteScheduler, MotionAdapter, PIAPipeline, ) from diffusers.utils import export_to_gif, load_image adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter") pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16) pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() pipe.enable_vae_slicing() image = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true" ) image = image.resize((512, 512)) prompt = "cat in a field" negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality" generator = torch.Generator("cpu").manual_seed(0) output = pipe(image=image, prompt=prompt, generator=generator) frames = output.frames[0] export_to_gif(frames, "pia-animation.gif") ``` Here are some sample outputs:
cat in a field.
cat in a field
> [!TIP] > If you plan on using a scheduler that can clip samples, make sure to disable it by setting `clip_sample=False` in the scheduler as this can also have an adverse effect on generated samples. Additionally, the PIA checkpoints can be sensitive to the beta schedule of the scheduler. We recommend setting this to `linear`. ## Using FreeInit [FreeInit: Bridging Initialization Gap in Video Diffusion Models](https://huggingface.co/papers/2312.07537) by Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu. FreeInit is an effective method that improves temporal consistency and overall quality of videos generated using video-diffusion-models without any addition training. It can be applied to PIA, AnimateDiff, ModelScope, VideoCrafter and various other video generation models seamlessly at inference time, and works by iteratively refining the latent-initialization noise. More details can be found it the paper. The following example demonstrates the usage of FreeInit. ```python import torch from diffusers import ( DDIMScheduler, MotionAdapter, PIAPipeline, ) from diffusers.utils import export_to_gif, load_image adapter = MotionAdapter.from_pretrained("openmmlab/PIA-condition-adapter") pipe = PIAPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter) # enable FreeInit # Refer to the enable_free_init documentation for a full list of configurable parameters pipe.enable_free_init(method="butterworth", use_fast_sampling=True) # Memory saving options pipe.enable_model_cpu_offload() pipe.enable_vae_slicing() pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) image = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true" ) image = image.resize((512, 512)) prompt = "cat in a field" negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality" generator = torch.Generator("cpu").manual_seed(0) output = pipe(image=image, prompt=prompt, generator=generator) frames = output.frames[0] export_to_gif(frames, "pia-freeinit-animation.gif") ```
cat in a field.
cat in a field
> [!WARNING] > FreeInit is not really free - the improved quality comes at the cost of extra computation. It requires sampling a few extra times depending on the `num_iters` parameter that is set when enabling it. Setting the `use_fast_sampling` parameter to `True` can improve the overall performance (at the cost of lower quality compared to when `use_fast_sampling=False` but still better results than vanilla video generation models). ## PIAPipeline [[autodoc]] PIAPipeline - all - __call__ - enable_freeu - disable_freeu - enable_free_init - disable_free_init - enable_vae_slicing - disable_vae_slicing - enable_vae_tiling - disable_vae_tiling ## PIAPipelineOutput [[autodoc]] pipelines.pia.PIAPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/pix2pix.md ================================================ # InstructPix2Pix
LoRA
[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros. The abstract from the paper is: *We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.* You can find additional information about InstructPix2Pix on the [project page](https://www.timothybrooks.com/instruct-pix2pix), [original codebase](https://github.com/timothybrooks/instruct-pix2pix), and try it out in a [demo](https://huggingface.co/spaces/timbrooks/instruct-pix2pix). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionInstructPix2PixPipeline [[autodoc]] StableDiffusionInstructPix2PixPipeline - __call__ - all - load_textual_inversion - load_lora_weights - save_lora_weights ## StableDiffusionXLInstructPix2PixPipeline [[autodoc]] StableDiffusionXLInstructPix2PixPipeline - __call__ - all ================================================ FILE: docs/source/en/api/pipelines/pixart.md ================================================ # PixArt-α ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png) [PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis](https://huggingface.co/papers/2310.00426) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. The abstract from the paper is: *The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-α's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-α only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-α excels in image quality, artistry, and semantic control. We hope PIXART-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.* You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha). Some notes about this pipeline: * It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit). * It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py). * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## Inference with under 8GB GPU VRAM Run the [`PixArtAlphaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example. First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library: ```bash pip install -U bitsandbytes ``` Then load the text encoder in 8-bit: ```python from transformers import T5EncoderModel from diffusers import PixArtAlphaPipeline import torch text_encoder = T5EncoderModel.from_pretrained( "PixArt-alpha/PixArt-XL-2-1024-MS", subfolder="text_encoder", load_in_8bit=True, device_map="auto", ) pipe = PixArtAlphaPipeline.from_pretrained( "PixArt-alpha/PixArt-XL-2-1024-MS", text_encoder=text_encoder, transformer=None, device_map="auto" ) ``` Now, use the `pipe` to encode a prompt: ```python with torch.no_grad(): prompt = "cute cat" prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt) ``` Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up some GPU VRAM: ```python import gc def flush(): gc.collect() torch.cuda.empty_cache() del text_encoder del pipe flush() ``` Then compute the latents with the prompt embeddings as inputs: ```python pipe = PixArtAlphaPipeline.from_pretrained( "PixArt-alpha/PixArt-XL-2-1024-MS", text_encoder=None, torch_dtype=torch.float16, ).to("cuda") latents = pipe( negative_prompt=None, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, prompt_attention_mask=prompt_attention_mask, negative_prompt_attention_mask=negative_prompt_attention_mask, num_images_per_prompt=1, output_type="latent", ).images del pipe.transformer flush() ``` > [!TIP] > Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded. Once the latents are computed, pass it off to the VAE to decode into a real image: ```python with torch.no_grad(): image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0] image = pipe.image_processor.postprocess(image, output_type="pil")[0] image.save("cat.png") ``` By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtAlphaPipeline`] with under 8GB GPU VRAM. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png) If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e). > [!WARNING] > Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit. While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB. ## PixArtAlphaPipeline [[autodoc]] PixArtAlphaPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/pixart_sigma.md ================================================ # PixArt-Σ ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage_sigma.jpg) [PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation](https://huggingface.co/papers/2403.04692) is Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. The abstract from the paper is: *In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of highquality visual content in industries such as film and gaming.* You can find the original codebase at [PixArt-alpha/PixArt-sigma](https://github.com/PixArt-alpha/PixArt-sigma) and all the available checkpoints at [PixArt-alpha](https://huggingface.co/PixArt-alpha). Some notes about this pipeline: * It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](https://hf.co/docs/transformers/model_doc/dit). * It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-sigma/blob/master/diffusion/data/datasets/utils.py). * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as PixArt-α, Stable Diffusion XL, Playground V2.0 and DALL-E 3, while being more efficient than them. * It shows the ability of generating super high resolution images, such as 2048px or even 4K. * It shows that text-to-image models can grow from a weak model to a stronger one through several improvements (VAEs, datasets, and so on.) > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. > [!TIP] > You can further improve generation quality by passing the generated image from [`PixArtSigmaPipeline`] to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. ## Inference with under 8GB GPU VRAM Run the [`PixArtSigmaPipeline`] with under 8GB GPU VRAM by loading the text encoder in 8-bit precision. Let's walk through a full-fledged example. First, install the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library: ```bash pip install -U bitsandbytes ``` Then load the text encoder in 8-bit: ```python from transformers import T5EncoderModel from diffusers import PixArtSigmaPipeline import torch text_encoder = T5EncoderModel.from_pretrained( "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="text_encoder", load_in_8bit=True, device_map="auto", ) pipe = PixArtSigmaPipeline.from_pretrained( "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", text_encoder=text_encoder, transformer=None, device_map="balanced" ) ``` Now, use the `pipe` to encode a prompt: ```python with torch.no_grad(): prompt = "cute cat" prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt) ``` Since text embeddings have been computed, remove the `text_encoder` and `pipe` from the memory, and free up some GPU VRAM: ```python import gc def flush(): gc.collect() torch.cuda.empty_cache() del text_encoder del pipe flush() ``` Then compute the latents with the prompt embeddings as inputs: ```python pipe = PixArtSigmaPipeline.from_pretrained( "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", text_encoder=None, torch_dtype=torch.float16, ).to("cuda") latents = pipe( negative_prompt=None, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, prompt_attention_mask=prompt_attention_mask, negative_prompt_attention_mask=negative_prompt_attention_mask, num_images_per_prompt=1, output_type="latent", ).images del pipe.transformer flush() ``` > [!TIP] > Notice that while initializing `pipe`, you're setting `text_encoder` to `None` so that it's not loaded. Once the latents are computed, pass it off to the VAE to decode into a real image: ```python with torch.no_grad(): image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0] image = pipe.image_processor.postprocess(image, output_type="pil")[0] image.save("cat.png") ``` By deleting components you aren't using and flushing the GPU VRAM, you should be able to run [`PixArtSigmaPipeline`] with under 8GB GPU VRAM. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/8bits_cat.png) If you want a report of your memory-usage, run this [script](https://gist.github.com/sayakpaul/3ae0f847001d342af27018a96f467e4e). > [!WARNING] > Text embeddings computed in 8-bit can impact the quality of the generated images because of the information loss in the representation space caused by the reduced precision. It's recommended to compare the outputs with and without 8-bit. While loading the `text_encoder`, you set `load_in_8bit` to `True`. You could also specify `load_in_4bit` to bring your memory requirements down even further to under 7GB. ## PixArtSigmaPipeline [[autodoc]] PixArtSigmaPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/prx.md ================================================ # PRX PRX generates high-quality images from text using a simplified MMDIT architecture where text tokens don't update through transformer blocks. It employs flow matching with discrete scheduling for efficient sampling and uses Google's T5Gemma-2B-2B-UL2 model for multi-language text encoding. The ~1.3B parameter transformer delivers fast inference without sacrificing quality. You can choose between Flux VAE (8x compression, 16 latent channels) for balanced quality and speed or DC-AE (32x compression, 32 latent channels) for latent compression and faster processing. ## Available models PRX offers multiple variants with different VAE configurations, each optimized for specific resolutions. Base models excel with detailed prompts, capturing complex compositions and subtle details. Fine-tuned models trained on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) improve aesthetic quality, especially with simpler prompts. | Model | Resolution | Fine-tuned | Distilled | Description | Suggested prompts | Suggested parameters | Recommended dtype | |:-----:|:-----------------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| | [`Photoroom/prx-256-t2i`](https://huggingface.co/Photoroom/prx-256-t2i)| 256 | No | No | Base model pre-trained at 256 with Flux VAE|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-256-t2i-sft`](https://huggingface.co/Photoroom/prx-256-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i`](https://huggingface.co/Photoroom/prx-512-t2i)| 512 | No | No | Base model pre-trained at 512 with Flux VAE |Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i-sft`](https://huggingface.co/Photoroom/prx-512-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/prx-512-t2i-sft`](https://huggingface.co/Photoroom/prx-512-t2i-sft) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i-dc-ae`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae)| 512 | No | No | Base model pre-trained at 512 with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae)|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i-dc-ae-sft`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae) | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | | [`Photoroom/prx-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/prx-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/prx-512-t2i-dc-ae-sft-distilled) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` |s Refer to [this](https://huggingface.co/collections/Photoroom/prx-models-68e66254c202ebfab99ad38e) collection for more information. ## Loading the pipeline Load the pipeline with [`~DiffusionPipeline.from_pretrained`]. ```py from diffusers.pipelines.prx import PRXPipeline # Load pipeline - VAE and text encoder will be loaded from HuggingFace pipe = PRXPipeline.from_pretrained("Photoroom/prx-512-t2i-sft", torch_dtype=torch.bfloat16) pipe.to("cuda") prompt = "A front-facing portrait of a lion the golden savanna at sunset." image = pipe(prompt, num_inference_steps=28, guidance_scale=5.0).images[0] image.save("prx_output.png") ``` ### Manual Component Loading Load components individually to customize the pipeline for instance to use quantized models. ```py import torch from diffusers.pipelines.prx import PRXPipeline from diffusers.models import AutoencoderKL, AutoencoderDC from diffusers.models.transformers.transformer_prx import PRXTransformer2DModel from diffusers.schedulers import FlowMatchEulerDiscreteScheduler from transformers import T5GemmaModel, GemmaTokenizerFast from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as BitsAndBytesConfig quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) # Load transformer transformer = PRXTransformer2DModel.from_pretrained( "checkpoints/prx-512-t2i-sft", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.bfloat16, ) # Load scheduler scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained( "checkpoints/prx-512-t2i-sft", subfolder="scheduler" ) # Load T5Gemma text encoder t5gemma_model = T5GemmaModel.from_pretrained("google/t5gemma-2b-2b-ul2", quantization_config=quant_config, torch_dtype=torch.bfloat16) text_encoder = t5gemma_model.encoder.to(dtype=torch.bfloat16) tokenizer = GemmaTokenizerFast.from_pretrained("google/t5gemma-2b-2b-ul2") tokenizer.model_max_length = 256 # Load VAE - choose either Flux VAE or DC-AE # Flux VAE vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-dev", subfolder="vae", quantization_config=quant_config, torch_dtype=torch.bfloat16) pipe = PRXPipeline( transformer=transformer, scheduler=scheduler, text_encoder=text_encoder, tokenizer=tokenizer, vae=vae ) pipe.to("cuda") ``` ## Memory Optimization For memory-constrained environments: ```py import torch from diffusers.pipelines.prx import PRXPipeline pipe = PRXPipeline.from_pretrained("Photoroom/prx-512-t2i-sft", torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() # Offload components to CPU when not in use # Or use sequential CPU offload for even lower memory pipe.enable_sequential_cpu_offload() ``` ## PRXPipeline [[autodoc]] PRXPipeline - all - __call__ ## PRXPipelineOutput [[autodoc]] pipelines.prx.pipeline_output.PRXPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/qwenimage.md ================================================ # QwenImage
LoRA
Qwen-Image from the Qwen team is an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. Experiments show strong general capabilities in both image generation and editing, with exceptional performance in text rendering, especially for Chinese. Qwen-Image comes in the following variants: | model type | model id | |:----------:|:--------:| | Qwen-Image | [`Qwen/Qwen-Image`](https://huggingface.co/Qwen/Qwen-Image) | | Qwen-Image-Edit | [`Qwen/Qwen-Image-Edit`](https://huggingface.co/Qwen/Qwen-Image-Edit) | | Qwen-Image-Edit Plus | [Qwen/Qwen-Image-Edit-2509](https://huggingface.co/Qwen/Qwen-Image-Edit-2509) | > [!TIP] > See the [Caching](../../optimization/cache) guide to speed up inference by storing and reusing intermediate outputs. ## LoRA for faster inference Use a LoRA from `lightx2v/Qwen-Image-Lightning` to speed up inference by reducing the number of steps. Refer to the code snippet below:
Code ```py from diffusers import DiffusionPipeline, FlowMatchEulerDiscreteScheduler import torch import math ckpt_id = "Qwen/Qwen-Image" # From # https://github.com/ModelTC/Qwen-Image-Lightning/blob/342260e8f5468d2f24d084ce04f55e101007118b/generate_with_diffusers.py#L82C9-L97C10 scheduler_config = { "base_image_seq_len": 256, "base_shift": math.log(3), # We use shift=3 in distillation "invert_sigmas": False, "max_image_seq_len": 8192, "max_shift": math.log(3), # We use shift=3 in distillation "num_train_timesteps": 1000, "shift": 1.0, "shift_terminal": None, # set shift_terminal to None "stochastic_sampling": False, "time_shift_type": "exponential", "use_beta_sigmas": False, "use_dynamic_shifting": True, "use_exponential_sigmas": False, "use_karras_sigmas": False, } scheduler = FlowMatchEulerDiscreteScheduler.from_config(scheduler_config) pipe = DiffusionPipeline.from_pretrained( ckpt_id, scheduler=scheduler, torch_dtype=torch.bfloat16 ).to("cuda") pipe.load_lora_weights( "lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-8steps-V1.0.safetensors" ) prompt = "a tiny astronaut hatching from an egg on the moon, Ultra HD, 4K, cinematic composition." negative_prompt = " " image = pipe( prompt=prompt, negative_prompt=negative_prompt, width=1024, height=1024, num_inference_steps=8, true_cfg_scale=1.0, generator=torch.manual_seed(0), ).images[0] image.save("qwen_fewsteps.png") ```
> [!TIP] > The `guidance_scale` parameter in the pipeline is there to support future guidance-distilled models when they come up. Note that passing `guidance_scale` to the pipeline is ineffective. To enable classifier-free guidance, please pass `true_cfg_scale` and `negative_prompt` (even an empty negative prompt like " ") should enable classifier-free guidance computations. ## Multi-image reference with QwenImageEditPlusPipeline With [`QwenImageEditPlusPipeline`], one can provide multiple images as input reference. ```py import torch from PIL import Image from diffusers import QwenImageEditPlusPipeline from diffusers.utils import load_image pipe = QwenImageEditPlusPipeline.from_pretrained( "Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16 ).to("cuda") image_1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/grumpy.jpg") image_2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peng.png") image = pipe( image=[image_1, image_2], prompt='''put the penguin and the cat at a game show called "Qwen Edit Plus Games"''', num_inference_steps=50 ).images[0] ``` ## Performance ### torch.compile Using `torch.compile` on the transformer provides ~2.4x speedup (A100 80GB: 4.70s → 1.93s): ```python import torch from diffusers import QwenImagePipeline pipe = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to("cuda") pipe.transformer = torch.compile(pipe.transformer) # First call triggers compilation (~7s overhead) # Subsequent calls run at ~2.4x faster image = pipe("a cat", num_inference_steps=50).images[0] ``` ### Batched Inference with Variable-Length Prompts When using classifier-free guidance (CFG) with prompts of different lengths, the pipeline properly handles padding through attention masking. This ensures padding tokens do not influence the generated output. ```python # CFG with different prompt lengths works correctly image = pipe( prompt="A cat", negative_prompt="blurry, low quality, distorted", true_cfg_scale=3.5, num_inference_steps=50, ).images[0] ``` For detailed benchmark scripts and results, see [this gist](https://gist.github.com/cdutr/bea337e4680268168550292d7819dc2f). ## QwenImagePipeline [[autodoc]] QwenImagePipeline - all - __call__ ## QwenImageImg2ImgPipeline [[autodoc]] QwenImageImg2ImgPipeline - all - __call__ ## QwenImageInpaintPipeline [[autodoc]] QwenImageInpaintPipeline - all - __call__ ## QwenImageEditPipeline [[autodoc]] QwenImageEditPipeline - all - __call__ ## QwenImageEditInpaintPipeline [[autodoc]] QwenImageEditInpaintPipeline - all - __call__ ## QwenImageControlNetPipeline [[autodoc]] QwenImageControlNetPipeline - all - __call__ ## QwenImageEditPlusPipeline [[autodoc]] QwenImageEditPlusPipeline - all - __call__ ## QwenImageLayeredPipeline [[autodoc]] QwenImageLayeredPipeline - all - __call__ ## QwenImagePipelineOutput [[autodoc]] pipelines.qwenimage.pipeline_output.QwenImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/sana.md ================================================ # SanaPipeline
LoRA MPS
[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han. The abstract from the paper is: *We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj) and [chenjy2003](https://github.com/chenjy2003). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model). Available models: | Model | Recommended dtype | |:-----:|:-----------------:| | [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` | | [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` | | [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` | | [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` | | [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` | | [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` | | [`Efficient-Large-Model/Sana_600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers) | `torch.float16` | Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) collection for more information. Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. > [!TIP] > Make sure to pass the `variant` argument for downloaded checkpoints to use lower disk space. Set it to `"fp16"` for models with recommended dtype as `torch.float16`, and `"bf16"` for models with recommended dtype as `torch.bfloat16`. By default, `torch.float32` weights are downloaded, which use twice the amount of disk storage. Additionally, `torch.float32` weights can be downcasted on-the-fly by specifying the `torch_dtype` argument. Read about it in the [docs](https://huggingface.co/docs/diffusers/v0.31.0/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained). ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = AutoModel.from_pretrained( "Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = SanaTransformer2DModel.from_pretrained( "Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = SanaPipeline.from_pretrained( "Efficient-Large-Model/Sana_1600M_1024px_diffusers", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt).images[0] image.save("sana.png") ``` ## SanaPipeline [[autodoc]] SanaPipeline - all - __call__ ## SanaPAGPipeline [[autodoc]] SanaPAGPipeline - all - __call__ ## SanaPipelineOutput [[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/sana_sprint.md ================================================ # SANA-Sprint
LoRA
[SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation](https://huggingface.co/papers/2503.09641) from NVIDIA, MIT HAN Lab, and Hugging Face by Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, Song Han The abstract from the paper is: *This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.* This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj), [shuchen Xue](https://github.com/scxue) and [Enze Xie](https://github.com/xieenze). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model/). Available models: | Model | Recommended dtype | |:-------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------:| | [`Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers) | `torch.bfloat16` | | [`Efficient-Large-Model/Sana_Sprint_0.6B_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_Sprint_0.6B_1024px_diffusers) | `torch.bfloat16` | Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-sprint-67d6810d65235085b3b17c76) collection for more information. Note: The recommended dtype mentioned is for the transformer weights. The text encoder must stay in `torch.bfloat16` and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaSprintPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaTransformer2DModel, SanaSprintPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = AutoModel.from_pretrained( "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.bfloat16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = SanaTransformer2DModel.from_pretrained( "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.bfloat16, ) pipeline = SanaSprintPipeline.from_pretrained( "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.bfloat16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt).images[0] image.save("sana.png") ``` ## Setting `max_timesteps` Users can tweak the `max_timesteps` value for experimenting with the visual quality of the generated outputs. The default `max_timesteps` value was obtained with an inference-time search process. For more details about it, check out the paper. ## Image to Image The [`SanaSprintImg2ImgPipeline`] is a pipeline for image-to-image generation. It takes an input image and a prompt, and generates a new image based on the input image and the prompt. ```py import torch from diffusers import SanaSprintImg2ImgPipeline from diffusers.utils.loading_utils import load_image image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png" ) pipe = SanaSprintImg2ImgPipeline.from_pretrained( "Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers", torch_dtype=torch.bfloat16) pipe.to("cuda") image = pipe( prompt="a cute pink bear", image=image, strength=0.5, height=832, width=480 ).images[0] image.save("output.png") ``` ## SanaSprintPipeline [[autodoc]] SanaSprintPipeline - all - __call__ ## SanaSprintImg2ImgPipeline [[autodoc]] SanaSprintImg2ImgPipeline - all - __call__ ## SanaPipelineOutput [[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/sana_video.md ================================================ # Sana-Video
LoRA MPS
[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie. The abstract from the paper is: *We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).* This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video). Available models: | Model | Recommended dtype | |:-----:|:-----------------:| | [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `torch.bfloat16` | Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information. Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype. ## Generation Pipelines ` The example below demonstrates how to use the text-to-video pipeline to generate a video using a text description. ```python pipe = SanaVideoPipeline.from_pretrained( "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", torch_dtype=torch.bfloat16, ) pipe.text_encoder.to(torch.bfloat16) pipe.vae.to(torch.float32) pipe.to("cuda") prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience." motion_scale = 30 motion_prompt = f" motion score: {motion_scale}." prompt = prompt + motion_prompt video = pipe( prompt=prompt, negative_prompt=negative_prompt, height=480, width=832, frames=81, guidance_scale=6, num_inference_steps=50, generator=torch.Generator(device="cuda").manual_seed(0), ).frames[0] export_to_video(video, "sana_video.mp4", fps=16) ``` The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description and a starting frame. ```python pipe = SanaImageToVideoPipeline.from_pretrained( "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", torch_dtype=torch.bfloat16, ) pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config, flow_shift=8.0) pipe.vae.to(torch.float32) pipe.text_encoder.to(torch.bfloat16) pipe.to("cuda") image = load_image("https://raw.githubusercontent.com/NVlabs/Sana/refs/heads/main/asset/samples/i2v-1.png") prompt = "A woman stands against a stunning sunset backdrop, her long, wavy brown hair gently blowing in the breeze. She wears a sleeveless, light-colored blouse with a deep V-neckline, which accentuates her graceful posture. The warm hues of the setting sun cast a golden glow across her face and hair, creating a serene and ethereal atmosphere. The background features a blurred landscape with soft, rolling hills and scattered clouds, adding depth to the scene. The camera remains steady, capturing the tranquil moment from a medium close-up angle." negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience." motion_scale = 30 motion_prompt = f" motion score: {motion_scale}." prompt = prompt + motion_prompt motion_scale = 30.0 video = pipe( image=image, prompt=prompt, negative_prompt=negative_prompt, height=480, width=832, frames=81, guidance_scale=6, num_inference_steps=50, generator=torch.Generator(device="cuda").manual_seed(0), ).frames[0] export_to_video(video, "sana-i2v.mp4", fps=16) ``` ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaVideoPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaVideoTransformer3DModel, SanaVideoPipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = AutoModel.from_pretrained( "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = SanaVideoTransformer3DModel.from_pretrained( "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = SanaVideoPipeline.from_pretrained( "Efficient-Large-Model/SANA-Video_2B_480p_diffusers", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) model_score = 30 prompt = "Evening, backlight, side lighting, soft light, high contrast, mid-shot, centered composition, clean solo shot, warm color. A young Caucasian man stands in a forest, golden light glimmers on his hair as sunlight filters through the leaves. He wears a light shirt, wind gently blowing his hair and collar, light dances across his face with his movements. The background is blurred, with dappled light and soft tree shadows in the distance. The camera focuses on his lifted gaze, clear and emotional." negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience." motion_prompt = f" motion score: {model_score}." prompt = prompt + motion_prompt output = pipeline( prompt=prompt, negative_prompt=negative_prompt, height=480, width=832, num_frames=81, guidance_scale=6.0, num_inference_steps=50 ).frames[0] export_to_video(output, "sana-video-output.mp4", fps=16) ``` ## SanaVideoPipeline [[autodoc]] SanaVideoPipeline - all - __call__ ## SanaImageToVideoPipeline [[autodoc]] SanaImageToVideoPipeline - all - __call__ ## SanaVideoPipelineOutput [[autodoc]] pipelines.sana_video.pipeline_sana_video.SanaVideoPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/self_attention_guidance.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Self-Attention Guidance [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://huggingface.co/papers/2210.00939) is by Susung Hong et al. The abstract from the paper is: *Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement.* You can find additional information about Self-Attention Guidance on the [project page](https://ku-cvlab.github.io/Self-Attention-Guidance), [original codebase](https://github.com/KU-CVLAB/Self-Attention-Guidance), and try it out in a [demo](https://huggingface.co/spaces/susunghong/Self-Attention-Guidance) or [notebook](https://colab.research.google.com/github/SusungHong/Self-Attention-Guidance/blob/main/SAG_Stable.ipynb). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableDiffusionSAGPipeline [[autodoc]] StableDiffusionSAGPipeline - __call__ - all ## StableDiffusionOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/semantic_stable_diffusion.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Semantic Guidance Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition. The abstract from the paper is: *Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.* > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## SemanticStableDiffusionPipeline [[autodoc]] SemanticStableDiffusionPipeline - all - __call__ ## SemanticStableDiffusionPipelineOutput [[autodoc]] pipelines.semantic_stable_diffusion.pipeline_output.SemanticStableDiffusionPipelineOutput - all ================================================ FILE: docs/source/en/api/pipelines/shap_e.md ================================================ # Shap-E The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai). The abstract from the paper is: *We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.* The original codebase can be found at [openai/shap-e](https://github.com/openai/shap-e). > [!TIP] > See the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## ShapEPipeline [[autodoc]] ShapEPipeline - all - __call__ ## ShapEImg2ImgPipeline [[autodoc]] ShapEImg2ImgPipeline - all - __call__ ## ShapEPipelineOutput [[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/skyreels_v2.md ================================================ # SkyReels-V2: Infinite-length Film Generative model [SkyReels-V2](https://huggingface.co/papers/2504.13074) by the SkyReels Team from Skywork AI. *Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at [this https URL](https://github.com/SkyworkAI/SkyReels-V2).* You can find all the original SkyReels-V2 checkpoints under the [Skywork](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) organization. The following SkyReels-V2 models are supported in Diffusers: - [SkyReels-V2 DF 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers) - [SkyReels-V2 DF 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-540P-Diffusers) - [SkyReels-V2 DF 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P-Diffusers) - [SkyReels-V2 T2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-540P-Diffusers) - [SkyReels-V2 T2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-T2V-14B-720P-Diffusers) - [SkyReels-V2 I2V 1.3B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P-Diffusers) - [SkyReels-V2 I2V 14B - 540P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-540P-Diffusers) - [SkyReels-V2 I2V 14B - 720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P-Diffusers) This model was contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz). > [!TIP] > Click on the SkyReels-V2 models in the right sidebar for more examples of video generation. ### A _Visual_ Demonstration The example below has the following parameters: - `base_num_frames=97` - `num_frames=97` - `num_inference_steps=30` - `ar_step=5` - `causal_block_size=5` With `vae_scale_factor_temporal=4`, expect `5` blocks of `5` frames each as calculated by: `num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each` And the maximum context length in the latent space is calculated with `base_num_latent_frames`: `base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 -> 25//5 = 5 blocks` Asynchronous Processing Timeline: ```text ┌─────────────────────────────────────────────────────────────────┐ │ Steps: 1 6 11 16 21 26 31 36 41 46 50 │ │ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ │ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ │ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ │ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ │ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ └─────────────────────────────────────────────────────────────────┘ ``` For Long Videos (`num_frames` > `base_num_frames`): `base_num_frames` acts as the "sliding window size" for processing long videos. Example: `257`-frame video with `base_num_frames=97`, `overlap_history=17` ```text ┌──── Iteration 1 (frames 1-97) ────┐ │ Processing window: 97 frames │ → 5 blocks, │ Generates: frames 1-97 │ async processing └───────────────────────────────────┘ ┌────── Iteration 2 (frames 81-177) ──────┐ │ Processing window: 97 frames │ │ Overlap: 17 frames (81-97) from prev │ → 5 blocks, │ Generates: frames 98-177 │ async processing └─────────────────────────────────────────┘ ┌────── Iteration 3 (frames 161-257) ──────┐ │ Processing window: 97 frames │ │ Overlap: 17 frames (161-177) from prev │ → 5 blocks, │ Generates: frames 178-257 │ async processing └──────────────────────────────────────────┘ ``` Each iteration independently runs the asynchronous processing with its own `5` blocks. `base_num_frames` controls: 1. Memory usage (larger window = more VRAM) 2. Model context length (must match training constraints) 3. Number of blocks per iteration (`base_num_latent_frames // causal_block_size`) Each block takes `30` steps to complete denoising. Block N starts at step: `1 + (N-1) x ar_step` Total steps: `30 + (5-1) x 5 = 50` steps Synchronous mode (`ar_step=0`) would process all blocks/frames simultaneously: ```text ┌──────────────────────────────────────────────┐ │ Steps: 1 ... 30 │ │ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │ └──────────────────────────────────────────────┘ ``` Total steps: `30` steps An example on how the step matrix is constructed for asynchronous processing: Given the parameters: (`num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5`) ``` - num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25 - step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948, 941, 932, 922, 912, 901, 888, 874, 859, 841, 822, 799, 773, 743, 708, 666, 615, 551, 470, 363, 216] ``` The algorithm creates a `50x25` `step_matrix` where: ``` - Row 1: [999×5, 999×5, 999×5, 999×5, 999×5] - Row 2: [995×5, 999×5, 999×5, 999×5, 999×5] - Row 3: [991×5, 999×5, 999×5, 999×5, 999×5] - ... - Row 7: [969×5, 995×5, 999×5, 999×5, 999×5] - ... - Row 21: [799×5, 888×5, 941×5, 975×5, 999×5] - ... - Row 35: [ 0×5, 216×5, 666×5, 822×5, 901×5] - ... - Row 42: [ 0×5, 0×5, 0×5, 551×5, 773×5] - ... - Row 50: [ 0×5, 0×5, 0×5, 0×5, 216×5] ``` Detailed Row `6` Analysis: ``` - step_matrix[5]: [ 975×5, 999×5, 999×5, 999×5, 999×5] - step_index[5]: [ 6×5, 1×5, 0×5, 0×5, 0×5] - step_update_mask[5]: [True×5, True×5, False×5, False×5, False×5] - valid_interval[5]: (0, 25) ``` Key Pattern: Block `i` lags behind Block `i-1` by exactly `ar_step=5` timesteps, creating the staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks. ### Text-to-Video Generation The example below demonstrates how to generate a video from text. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. From the original repo: >You can use --ar_step 5 to enable asynchronous inference. When asynchronous inference, --causal_block_size 5 is recommended while it is not supposed to be set for synchronous generation... Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance. ```py import torch from diffusers import AutoModel, SkyReelsV2DiffusionForcingPipeline, UniPCMultistepScheduler from diffusers.utils import export_to_video model_id = "Skywork/SkyReels-V2-DF-1.3B-540P-Diffusers" vae = AutoModel.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipeline = SkyReelsV2DiffusionForcingPipeline.from_pretrained( model_id, vae=vae, torch_dtype=torch.bfloat16, ) pipeline.to("cuda") flow_shift = 8.0 # 8.0 for T2V, 5.0 for I2V pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." output = pipeline( prompt=prompt, num_inference_steps=30, height=544, # 720 for 720P width=960, # 1280 for 720P num_frames=97, base_num_frames=97, # 121 for 720P ar_step=5, # Controls asynchronous inference (0 for synchronous mode) causal_block_size=5, # Number of frames in each block for asynchronous processing overlap_history=None, # Number of frames to overlap for smooth transitions in long videos; 17 for long video generations addnoise_condition=20, # Improves consistency in long video generation ).frames[0] export_to_video(output, "video.mp4", fps=24, quality=8) ``` ### First-Last-Frame-to-Video Generation The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description, a starting frame, and an ending frame. ```python import numpy as np import torch import torchvision.transforms.functional as TF from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline, UniPCMultistepScheduler from diffusers.utils import export_to_video, load_image model_id = "Skywork/SkyReels-V2-DF-1.3B-720P-Diffusers" vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipeline = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained( model_id, vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png") last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png") def aspect_ratio_resize(image, pipeline, max_area=720 * 1280): aspect_ratio = image.height / image.width mod_value = pipeline.vae_scale_factor_spatial * pipeline.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value image = image.resize((width, height)) return image, height, width def center_crop_resize(image, height, width): # Calculate resize ratio to match first frame dimensions resize_ratio = max(width / image.width, height / image.height) # Resize the image width = round(image.width * resize_ratio) height = round(image.height * resize_ratio) size = [width, height] image = TF.center_crop(image, size) return image, height, width first_frame, height, width = aspect_ratio_resize(first_frame, pipeline) if last_frame.size != first_frame.size: last_frame, _, _ = center_crop_resize(last_frame, height, width) prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." output = pipeline( image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.0 ).frames[0] export_to_video(output, "video.mp4", fps=24, quality=8) ``` ### Video-to-Video Generation `SkyReelsV2DiffusionForcingVideoToVideoPipeline` extends a given video. ```python import numpy as np import torch import torchvision.transforms.functional as TF from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline, UniPCMultistepScheduler from diffusers.utils import export_to_video, load_video model_id = "Skywork/SkyReels-V2-DF-1.3B-720P-Diffusers" vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipeline = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained( model_id, vae=vae, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") flow_shift = 5.0 # 8.0 for T2V, 5.0 for I2V pipeline.scheduler = UniPCMultistepScheduler.from_config(pipeline.scheduler.config, flow_shift=flow_shift) video = load_video("input_video.mp4") prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." output = pipeline( video=video, prompt=prompt, height=720, width=1280, guidance_scale=5.0, overlap_history=17, num_inference_steps=30, num_frames=257, base_num_frames=121#, ar_step=5, causal_block_size=5, ).frames[0] export_to_video(output, "video.mp4", fps=24, quality=8) # Total frames will be the number of frames of the given video + 257 ``` ## Notes - SkyReels-V2 supports LoRAs with [`~loaders.SkyReelsV2LoraLoaderMixin.load_lora_weights`]. `SkyReelsV2Pipeline` and `SkyReelsV2ImageToVideoPipeline` are also available without Diffusion Forcing framework applied. ## SkyReelsV2DiffusionForcingPipeline [[autodoc]] SkyReelsV2DiffusionForcingPipeline - all - __call__ ## SkyReelsV2DiffusionForcingImageToVideoPipeline [[autodoc]] SkyReelsV2DiffusionForcingImageToVideoPipeline - all - __call__ ## SkyReelsV2DiffusionForcingVideoToVideoPipeline [[autodoc]] SkyReelsV2DiffusionForcingVideoToVideoPipeline - all - __call__ ## SkyReelsV2Pipeline [[autodoc]] SkyReelsV2Pipeline - all - __call__ ## SkyReelsV2ImageToVideoPipeline [[autodoc]] SkyReelsV2ImageToVideoPipeline - all - __call__ ## SkyReelsV2PipelineOutput [[autodoc]] pipelines.skyreels_v2.pipeline_output.SkyReelsV2PipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_audio.md ================================================ # Stable Audio Stable Audio was proposed in [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample. Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder. Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT. The abstract of the paper is the following: *Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.* This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools). ## Tips When constructing a prompt, keep in mind: * Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno"). * Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality". During inference: * The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference. * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableAudioPipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, StableAudioDiTModel, StableAudioPipeline from diffusers.utils import export_to_video from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "stabilityai/stable-audio-open-1.0", subfolder="text_encoder", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = StableAudioDiTModel.from_pretrained( "stabilityai/stable-audio-open-1.0", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = StableAudioPipeline.from_pretrained( "stabilityai/stable-audio-open-1.0", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "The sound of a hammer hitting a wooden surface." negative_prompt = "Low quality." audio = pipeline( prompt, negative_prompt=negative_prompt, num_inference_steps=200, audio_end_in_s=10.0, num_waveforms_per_prompt=3, generator=generator, ).audios output = audio[0].T.float().cpu().numpy() sf.write("hammer.wav", output, pipeline.vae.sampling_rate) ``` ## StableAudioPipeline [[autodoc]] StableAudioPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/stable_cascade.md ================================================ # Stable Cascade This model is built upon the [Würstchen](https://openreview.net/forum?id=gU58d5QeGv) architecture and its main difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this important? The smaller the latent space, the **faster** you can run inference and the **cheaper** the training becomes. How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable Diffusion 1.5. Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well. The original codebase can be found at [Stability-AI/StableCascade](https://github.com/Stability-AI/StableCascade). ## Model Overview Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images, hence the name "Stable Cascade". Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible for generating the small 24 x 24 latents given a text prompt. The Stage C model operates on the small 24 x 24 latents and denoises the latents conditioned on text prompts. The model is also the largest component in the Cascade pipeline and is meant to be used with the `StableCascadePriorPipeline` The Stage B and Stage A models are used with the `StableCascadeDecoderPipeline` and are responsible for generating the final image given the small 24 x 24 latents. > [!WARNING] > There are some restrictions on data types that can be used with the Stable Cascade models. The official checkpoints for the `StableCascadePriorPipeline` do not support the `torch.float16` data type. Please use `torch.bfloat16` instead. > > In order to use the `torch.bfloat16` data type with the `StableCascadeDecoderPipeline` you need to have PyTorch 2.2.0 or higher installed. This also means that using the `StableCascadeCombinedPipeline` with `torch.bfloat16` requires PyTorch 2.2.0 or higher, since it calls the `StableCascadeDecoderPipeline` internally. > > If it is not possible to install PyTorch 2.2.0 or higher in your environment, the `StableCascadeDecoderPipeline` can be used on its own with the `torch.float16` data type. You can download the full precision or `bf16` variant weights for the pipeline and cast the weights to `torch.float16`. ## Usage example ```python import torch from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = "" prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16) prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 ) decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings.to(torch.float16), prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade.png") ``` ## Using the Lite Versions of the Stage B and Stage C models ```python import torch from diffusers import ( StableCascadeDecoderPipeline, StableCascadePriorPipeline, StableCascadeUNet, ) prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = "" prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite") decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite") prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet) prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 ) decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings, prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade.png") ``` ## Loading original checkpoints with `from_single_file` Loading the original format checkpoints is supported via `from_single_file` method in the StableCascadeUNet. ```python import torch from diffusers import ( StableCascadeDecoderPipeline, StableCascadePriorPipeline, StableCascadeUNet, ) prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = "" prior_unet = StableCascadeUNet.from_single_file( "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors", torch_dtype=torch.bfloat16 ) decoder_unet = StableCascadeUNet.from_single_file( "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors", torch_dtype=torch.bfloat16 ) prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16) prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 ) decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings, prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade-single-file.png") ``` ## Uses ### Direct Use The model is intended for research purposes for now. Possible research areas and tasks include - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. Excluded uses are described below. ### Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's [Acceptable Use Policy](https://stability.ai/use-policy). ## Limitations and Bias ### Limitations - Faces and people in general may not be generated properly. - The autoencoding part of the model is lossy. ## StableCascadeCombinedPipeline [[autodoc]] StableCascadeCombinedPipeline - all - __call__ ## StableCascadePriorPipeline [[autodoc]] StableCascadePriorPipeline - all - __call__ ## StableCascadePriorPipelineOutput [[autodoc]] pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput ## StableCascadeDecoderPipeline [[autodoc]] StableCascadeDecoderPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/adapter.md ================================================ # T2I-Adapter [T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.08453) by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. Using the pretrained models we can provide control images (for example, a depth map) to control Stable Diffusion text-to-image generation so that it follows the structure of the depth image and fills in the details. The abstract of the paper is the following: *The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❤️ . ## StableDiffusionAdapterPipeline [[autodoc]] StableDiffusionAdapterPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## StableDiffusionXLAdapterPipeline [[autodoc]] StableDiffusionXLAdapterPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/depth2img.md ================================================ # Depth-to-image
LoRA
The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## StableDiffusionDepth2ImgPipeline [[autodoc]] StableDiffusionDepth2ImgPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion - load_lora_weights - save_lora_weights ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/gligen.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # GLIGEN (Grounded Language-to-Image Generation) The GLIGEN model was created by researchers and engineers from [University of Wisconsin-Madison, Columbia University, and Microsoft](https://github.com/gligen/GLIGEN). The [`StableDiffusionGLIGENPipeline`] and [`StableDiffusionGLIGENTextImagePipeline`] can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes with [`StableDiffusionGLIGENPipeline`], if input images are given, [`StableDiffusionGLIGENTextImagePipeline`] can insert objects described by text at the region defined by bounding boxes. Otherwise, it'll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It's trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs. The abstract from the [paper](https://huggingface.co/papers/2301.07093) is: *Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN’s zeroshot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin.* > [!TIP] > Make sure to check out the Stable Diffusion [Tips](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality and how to reuse pipeline components efficiently! > > If you want to use one of the official checkpoints for a task, explore the [gligen](https://huggingface.co/gligen) Hub organizations! [`StableDiffusionGLIGENPipeline`] was contributed by [Nikhil Gajendrakumar](https://github.com/nikhil-masterful) and [`StableDiffusionGLIGENTextImagePipeline`] was contributed by [Nguyễn Công Tú Anh](https://github.com/tuanh123789). ## StableDiffusionGLIGENPipeline [[autodoc]] StableDiffusionGLIGENPipeline - all - __call__ - enable_vae_slicing - disable_vae_slicing - enable_vae_tiling - disable_vae_tiling - enable_model_cpu_offload - prepare_latents - enable_fuser ## StableDiffusionGLIGENTextImagePipeline [[autodoc]] StableDiffusionGLIGENTextImagePipeline - all - __call__ - enable_vae_slicing - disable_vae_slicing - enable_vae_tiling - disable_vae_tiling - enable_model_cpu_offload - prepare_latents - enable_fuser ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/image_variation.md ================================================ # Image variation The Stable Diffusion model can also generate variations from an input image. It uses a fine-tuned version of a Stable Diffusion model by [Justin Pinkney](https://www.justinpinkney.com/) from [Lambda](https://lambdalabs.com/). The original codebase can be found at [LambdaLabsML/lambda-diffusers](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) and additional official checkpoints for image variation can be found at [lambdalabs/sd-image-variations-diffusers](https://huggingface.co/lambdalabs/sd-image-variations-diffusers). > [!TIP] > Make sure to check out the Stable Diffusion [Tips](./overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionImageVariationPipeline [[autodoc]] StableDiffusionImageVariationPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/img2img.md ================================================ # Image-to-image
LoRA
The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images. The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon. The abstract from the paper is: *Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.* > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionImg2ImgPipeline [[autodoc]] StableDiffusionImg2ImgPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion - from_single_file - load_lora_weights - save_lora_weights ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/inpaint.md ================================================ # Inpainting
LoRA
The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion. ## Tips It is recommended to use this pipeline with checkpoints that have been specifically fine-tuned for inpainting, such as [stable-diffusion-v1-5/stable-diffusion-inpainting](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting). Default text-to-image Stable Diffusion checkpoints, such as [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are also compatible but they might be less performant. > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## StableDiffusionInpaintPipeline [[autodoc]] StableDiffusionInpaintPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - load_textual_inversion - load_lora_weights - save_lora_weights ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md ================================================ # Latent upscaler The Stable Diffusion latent upscaler model was created by [Katherine Crowson](https://github.com/crowsonkb/k-diffusion) in collaboration with [Stability AI](https://stability.ai/). It is used to enhance the output image resolution by a factor of 2 (see this demo [notebook](https://colab.research.google.com/drive/1o1qYJcFeywzCIdkfKJy7cTpgZTCM2EI4) for a demonstration of the original implementation). > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## StableDiffusionLatentUpscalePipeline [[autodoc]] StableDiffusionLatentUpscalePipeline - all - __call__ - enable_sequential_cpu_offload - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Text-to-(RGB, depth)
LoRA
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps. Two checkpoints are available for use: - [ldm3d-original](https://huggingface.co/Intel/ldm3d). The original checkpoint used in the [paper](https://huggingface.co/papers/2305.10853) - [ldm3d-4c](https://huggingface.co/Intel/ldm3d-4c). The new version of LDM3D using 4 channels inputs instead of 6-channels inputs and finetuned on higher resolution images. The abstract from the paper is: *This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which uses the generated RGB images and depth maps to create immersive and interactive 360-degree-view experiences using TouchDesigner. This technology has the potential to transform a wide range of industries, from entertainment and gaming to architecture and design. Overall, this paper presents a significant contribution to the field of generative AI and computer vision, and showcases the potential of LDM3D and DepthFusion to revolutionize content creation and digital experiences. A short video summarizing the approach can be found at [this url](https://t.ly/tdi2).* > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionLDM3DPipeline [[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.StableDiffusionLDM3DPipeline - all - __call__ ## LDM3DPipelineOutput [[autodoc]] pipelines.stable_diffusion_ldm3d.pipeline_stable_diffusion_ldm3d.LDM3DPipelineOutput - all - __call__ # Upscaler [LDM3D-VR](https://huggingface.co/papers/2311.03226) is an extended version of LDM3D. The abstract from the paper is: *Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods* Two checkpoints are available for use: - [ldm3d-pano](https://huggingface.co/Intel/ldm3d-pano). This checkpoint enables the generation of panoramic images and requires the StableDiffusionLDM3DPipeline pipeline to be used. - [ldm3d-sr](https://huggingface.co/Intel/ldm3d-sr). This checkpoint enables the upscaling of RGB and depth images. Can be used in cascade after the original LDM3D pipeline using the StableDiffusionUpscaleLDM3DPipeline from communauty pipeline. ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/overview.md ================================================ # Stable Diffusion pipelines
LoRA
Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. For more details about how Stable Diffusion works and how it differs from the base latent diffusion model, take a look at the Stability AI [announcement](https://stability.ai/blog/stable-diffusion-announcement) and our own [blog post](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) for more technical details. You can find the original codebase for Stable Diffusion v1.0 at [CompVis/stable-diffusion](https://github.com/CompVis/stable-diffusion) and Stable Diffusion v2.0 at [Stability-AI/stablediffusion](https://github.com/Stability-AI/stablediffusion) as well as their original scripts for various tasks. Additional official checkpoints for the different Stable Diffusion versions and tasks can be found on the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations. Explore these organizations to find the best checkpoint for your use-case! The table below summarizes the available Stable Diffusion pipelines, their supported tasks, and an interactive demo:
Pipeline Supported tasks 🤗 Space
StableDiffusion text-to-image
StableDiffusionImg2Img image-to-image
StableDiffusionInpaint inpainting
StableDiffusionDepth2Img depth-to-image
StableDiffusionImageVariation image variation
StableDiffusionPipelineSafe filtered text-to-image
StableDiffusion2 text-to-image, inpainting, depth-to-image, super-resolution
StableDiffusionXL text-to-image, image-to-image
StableDiffusionLatentUpscale super-resolution
StableDiffusionUpscale super-resolution
StableDiffusionLDM3D text-to-rgb, text-to-depth, text-to-pano
StableDiffusionUpscaleLDM3D ldm3d super-resolution
## Tips To help you get the most out of the Stable Diffusion pipelines, here are a few tips for improving performance and usability. These tips are applicable to all Stable Diffusion pipelines. ### Explore tradeoff between speed and quality [`StableDiffusionPipeline`] uses the [`PNDMScheduler`] by default, but 🤗 Diffusers provides many other schedulers (some of which are faster or output better quality) that are compatible. For example, if you want to use the [`EulerDiscreteScheduler`] instead of the default: ```py from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) # or euler_scheduler = EulerDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=euler_scheduler) ``` ### Reuse pipeline components to save memory To save memory and use the same components across multiple pipelines, use the `.components` method to avoid loading weights into RAM more than once. ```py from diffusers import ( StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipeline, ) text2img = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") img2img = StableDiffusionImg2ImgPipeline(**text2img.components) inpaint = StableDiffusionInpaintPipeline(**text2img.components) # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline ``` ### Create web demos using `gradio` The Stable Diffusion pipelines are automatically supported in [Gradio](https://github.com/gradio-app/gradio/), a library that makes creating beautiful and user-friendly machine learning apps on the web a breeze. First, make sure you have Gradio installed: ```sh pip install -U gradio ``` Then, create a web demo around any Stable Diffusion-based pipeline. For example, you can create an image generation pipeline in a single line of code with Gradio's [`Interface.from_pipeline`](https://www.gradio.app/docs/interface#interface-from-pipeline) function: ```py from diffusers import StableDiffusionPipeline import gradio as gr pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") gr.Interface.from_pipeline(pipe).launch() ``` which opens an intuitive drag-and-drop interface in your browser: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gradio-panda.png) Similarly, you could create a demo for an image-to-image pipeline with: ```py from diffusers import StableDiffusionImg2ImgPipeline import gradio as gr pipe = StableDiffusionImg2ImgPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5") gr.Interface.from_pipeline(pipe).launch() ``` By default, the web demo runs on a local server. If you'd like to share it with others, you can generate a temporary public link by setting `share=True` in `launch()`. Or, you can host your demo on [Hugging Face Spaces](https://huggingface.co/spaces)https://huggingface.co/spaces for a permanent link. ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/sdxl_turbo.md ================================================ # SDXL Turbo Stable Diffusion XL (SDXL) Turbo was proposed in [Adversarial Diffusion Distillation](https://stability.ai/research/adversarial-diffusion-distillation) by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. The abstract from the paper is: *We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.* ## Tips - SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl), which means it also has the same API. Please refer to the [SDXL](./stable_diffusion_xl) API reference for more details. - SDXL Turbo should disable guidance scale by setting `guidance_scale=0.0`. - SDXL Turbo should use `timestep_spacing='trailing'` for the scheduler and use between 1 and 4 steps. - SDXL Turbo has been trained to generate images of size 512x512. - SDXL Turbo is open-access, but not open-source meaning that one might have to buy a model license in order to use it for commercial applications. Make sure to read the [official model card](https://huggingface.co/stabilityai/sdxl-turbo) to learn more. > [!TIP] > To learn how to use SDXL Turbo for various tasks, how to optimize performance, and other usage examples, take a look at the [SDXL Turbo](../../../using-diffusers/sdxl_turbo) guide. > > Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md ================================================ # Stable Diffusion 2 Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). *The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps. Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: | Task | Repository | |-------------------------|---------------------------------------------------------------------------------------------------------------| | text-to-image (512x512) | [stabilityai/stable-diffusion-2-base](https://huggingface.co/stabilityai/stable-diffusion-2-base) | | text-to-image (768x768) | [stabilityai/stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) | | inpainting | [stabilityai/stable-diffusion-2-inpainting](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting) | | super-resolution | [stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | | depth-to-image | [stabilityai/stable-diffusion-2-depth](https://huggingface.co/stabilityai/stable-diffusion-2-depth) | Here are some examples for how to use Stable Diffusion 2 for each task: > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## Text-to-image ```py from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler import torch repo_id = "stabilityai/stable-diffusion-2-base" pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda") prompt = "High quality photo of an astronaut riding a horse in space" image = pipe(prompt, num_inference_steps=25).images[0] image ``` ## Inpainting ```py import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import load_image, make_image_grid img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" init_image = load_image(img_url).resize((512, 512)) mask_image = load_image(mask_url).resize((512, 512)) repo_id = "stabilityai/stable-diffusion-2-inpainting" pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe = pipe.to("cuda") prompt = "Face of a yellow cat, high resolution, sitting on a park bench" image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] make_image_grid([init_image, mask_image, image], rows=1, cols=3) ``` ## Super-resolution ```py from diffusers import StableDiffusionUpscalePipeline from diffusers.utils import load_image, make_image_grid import torch # load model and scheduler model_id = "stabilityai/stable-diffusion-x4-upscaler" pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipeline = pipeline.to("cuda") # let's download an image url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" low_res_img = load_image(url) low_res_img = low_res_img.resize((128, 128)) prompt = "a white cat" upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2) ``` ## Depth-to-image ```py import torch from diffusers import StableDiffusionDepth2ImgPipeline from diffusers.utils import load_image, make_image_grid pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-2-depth", torch_dtype=torch.float16, ).to("cuda") url = "http://images.cocodataset.org/val2017/000000039769.jpg" init_image = load_image(url) prompt = "two tigers" negative_prompt = "bad, deformed, ugly, bad anotomy" image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0] make_image_grid([init_image, image], rows=1, cols=2) ``` ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md ================================================ # Stable Diffusion 3
LoRA MPS
Stable Diffusion 3 (SD3) was proposed in [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://huggingface.co/papers/2403.03206) by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. The abstract from the paper is: *Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations.* ## Usage Example _As the model is gated, before using it with diffusers you first need to go to the [Stable Diffusion 3 Medium Hugging Face page](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers), fill in the form and accept the gate. Once you are in, you need to login so that your system knows you’ve accepted the gate._ Use the command below to log in: ```bash hf auth login ``` > [!TIP] > The SD3 pipeline uses three text encoders to generate an image. Model offloading is necessary in order for it to run on most commodity hardware. Please use the `torch.float16` data type for additional memory savings. ```python import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) pipe.to("cuda") image = pipe( prompt="a photo of a cat holding a sign that says hello world", negative_prompt="", num_inference_steps=28, height=1024, width=1024, guidance_scale=7.0, ).images[0] image.save("sd3_hello_world.png") ``` **Note:** Stable Diffusion 3.5 can also be run using the SD3 pipeline, and all mentioned optimizations and techniques apply to it as well. In total there are three official models in the SD3 family: - [`stabilityai/stable-diffusion-3-medium-diffusers`](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) - [`stabilityai/stable-diffusion-3.5-large`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large) - [`stabilityai/stable-diffusion-3.5-large-turbo`](https://huggingface.co/stabilityai/stable-diffusion-3-5-large-turbo) ## Image Prompting with IP-Adapters An IP-Adapter lets you prompt SD3 with images, in addition to the text prompt. This is especially useful when describing complex concepts that are difficult to articulate through text alone and you have reference images. To load and use an IP-Adapter, you need: - `image_encoder`: Pre-trained vision model used to obtain image features, usually a CLIP image encoder. - `feature_extractor`: Image processor that prepares the input image for the chosen `image_encoder`. - `ip_adapter_id`: Checkpoint containing parameters of image cross attention layers and image projection. IP-Adapters are trained for a specific model architecture, so they also work in finetuned variations of the base model. You can use the [`~SD3IPAdapterMixin.set_ip_adapter_scale`] function to adjust how strongly the output aligns with the image prompt. The higher the value, the more closely the model follows the image prompt. A default value of 0.5 is typically a good balance, ensuring the model considers both the text and image prompts equally. ```python import torch from PIL import Image from diffusers import StableDiffusion3Pipeline from transformers import SiglipVisionModel, SiglipImageProcessor image_encoder_id = "google/siglip-so400m-patch14-384" ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter" feature_extractor = SiglipImageProcessor.from_pretrained( image_encoder_id, torch_dtype=torch.float16 ) image_encoder = SiglipVisionModel.from_pretrained( image_encoder_id, torch_dtype=torch.float16 ).to( "cuda") pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.float16, feature_extractor=feature_extractor, image_encoder=image_encoder, ).to("cuda") pipe.load_ip_adapter(ip_adapter_id) pipe.set_ip_adapter_scale(0.6) ref_img = Image.open("image.jpg").convert('RGB') image = pipe( width=1024, height=1024, prompt="a cat", negative_prompt="lowres, low quality, worst quality", num_inference_steps=24, guidance_scale=5.0, ip_adapter_image=ref_img ).images[0] image.save("result.jpg") ```
IP-Adapter examples with prompt "a cat"
> [!TIP] > Check out [IP-Adapter](../../../using-diffusers/ip_adapter) to learn more about how IP-Adapters work. ## Memory Optimisations for SD3 SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using `fp16` precision. The following section outlines a few memory optimizations in Diffusers that make it easier to run SD3 on low resource hardware. ### Running Inference with Model Offloading The most basic memory optimization available in Diffusers allows you to offload the components of the model to CPU during inference in order to save memory, while seeing a slight increase in inference latency. Model offloading will only move a model component onto the GPU when it needs to be executed, while keeping the remaining components on the CPU. ```python import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() image = pipe( prompt="a photo of a cat holding a sign that says hello world", negative_prompt="", num_inference_steps=28, height=1024, width=1024, guidance_scale=7.0, ).images[0] image.save("sd3_hello_world.png") ``` ### Dropping the T5 Text Encoder during Inference Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly decrease the memory requirements for SD3 with only a slight loss in performance. ```python import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", text_encoder_3=None, tokenizer_3=None, torch_dtype=torch.float16 ) pipe.to("cuda") image = pipe( prompt="a photo of a cat holding a sign that says hello world", negative_prompt="", num_inference_steps=28, height=1024, width=1024, guidance_scale=7.0, ).images[0] image.save("sd3_hello_world-no-T5.png") ``` ### Using a Quantized Version of the T5 Text Encoder We can leverage the `bitsandbytes` library to load and quantize the T5-XXL text encoder to 8-bit precision. This allows you to keep using all three text encoders while only slightly impacting performance. First install the `bitsandbytes` library. ```shell pip install bitsandbytes ``` Then load the T5-XXL model using the `BitsAndBytesConfig`. ```python import torch from diffusers import StableDiffusion3Pipeline from transformers import T5EncoderModel, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_8bit=True) model_id = "stabilityai/stable-diffusion-3-medium-diffusers" text_encoder = T5EncoderModel.from_pretrained( model_id, subfolder="text_encoder_3", quantization_config=quantization_config, ) pipe = StableDiffusion3Pipeline.from_pretrained( model_id, text_encoder_3=text_encoder, device_map="balanced", torch_dtype=torch.float16 ) image = pipe( prompt="a photo of a cat holding a sign that says hello world", negative_prompt="", num_inference_steps=28, height=1024, width=1024, guidance_scale=7.0, ).images[0] image.save("sd3_hello_world-8bit-T5.png") ``` You can find the end-to-end script [here](https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1). ## Performance Optimizations for SD3 ### Using Torch Compile to Speed Up Inference Using compiled components in the SD3 pipeline can speed up inference by as much as 4X. The following code snippet demonstrates how to compile the Transformer and VAE components of the SD3 pipeline. ```python import torch from diffusers import StableDiffusion3Pipeline torch.set_float32_matmul_precision("high") torch._inductor.config.conv_1x1_as_mm = True torch._inductor.config.coordinate_descent_tuning = True torch._inductor.config.epilogue_fusion = False torch._inductor.config.coordinate_descent_check_all_directions = True pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16 ).to("cuda") pipe.set_progress_bar_config(disable=True) pipe.transformer.to(memory_format=torch.channels_last) pipe.vae.to(memory_format=torch.channels_last) pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True) pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True) # Warm Up prompt = "a photo of a cat holding a sign that says hello world" for _ in range(3): _ = pipe(prompt=prompt, generator=torch.manual_seed(1)) # Run Inference image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0] image.save("sd3_hello_world.png") ``` Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97). ## Quantization Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model. Refer to the [Quantization](../../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableDiffusion3Pipeline`] for inference with bitsandbytes. ```py import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SD3Transformer2DModel, StableDiffusion3Pipeline from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel quant_config = BitsAndBytesConfig(load_in_8bit=True) text_encoder_8bit = T5EncoderModel.from_pretrained( "stabilityai/stable-diffusion-3.5-large", subfolder="text_encoder_3", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = SD3Transformer2DModel.from_pretrained( "stabilityai/stable-diffusion-3.5-large", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) pipeline = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3.5-large", text_encoder=text_encoder_8bit, transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", ) prompt = "a tiny astronaut hatching from an egg on the moon" image = pipeline(prompt, num_inference_steps=28, guidance_scale=7.0).images[0] image.save("sd3.png") ``` ## Using Long Prompts with the T5 Text Encoder By default, the T5 Text Encoder prompt uses a maximum sequence length of `256`. This can be adjusted by setting the `max_sequence_length` to accept fewer or more tokens. Keep in mind that longer sequences require additional resources and result in longer generation times, such as during batch inference. ```python prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" image = pipe( prompt=prompt, negative_prompt="", num_inference_steps=28, guidance_scale=4.5, max_sequence_length=512, ).images[0] ``` ### Sending a different prompt to the T5 Text Encoder You can send a different prompt to the CLIP Text Encoders and the T5 Text Encoder to prevent the prompt from being truncated by the CLIP Text Encoders and to improve generation. > [!TIP] > The prompt with the CLIP Text Encoders is still truncated to the 77 token limit. ```python prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree." prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight" image = pipe( prompt=prompt, prompt_3=prompt_3, negative_prompt="", num_inference_steps=28, guidance_scale=4.5, max_sequence_length=512, ).images[0] ``` ## Tiny AutoEncoder for Stable Diffusion 3 Tiny AutoEncoder for Stable Diffusion (TAESD3) is a tiny distilled version of Stable Diffusion 3's VAE by [Ollin Boer Bohan](https://github.com/madebyollin/taesd) that can decode [`StableDiffusion3Pipeline`] latents almost instantly. To use with Stable Diffusion 3: ```python import torch from diffusers import StableDiffusion3Pipeline, AutoencoderTiny pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16 ) pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) pipe = pipe.to("cuda") prompt = "slice of delicious New York-style berry cheesecake" image = pipe(prompt, num_inference_steps=25).images[0] image.save("cheesecake.png") ``` ## Loading the original checkpoints via `from_single_file` The `SD3Transformer2DModel` and `StableDiffusion3Pipeline` classes support loading the original checkpoints via the `from_single_file` method. This method allows you to load the original checkpoint files that were used to train the models. ## Loading the original checkpoints for the `SD3Transformer2DModel` ```python from diffusers import SD3Transformer2DModel model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors") ``` ## Loading the single checkpoint for the `StableDiffusion3Pipeline` ### Loading the single file checkpoint without T5 ```python import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_single_file( "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors", torch_dtype=torch.float16, text_encoder_3=None ) pipe.enable_model_cpu_offload() image = pipe("a picture of a cat holding a sign that says hello world").images[0] image.save('sd3-single-file.png') ``` ### Loading the single file checkpoint with T5 > [!TIP] > The following example loads a checkpoint stored in a 8-bit floating point format which requires PyTorch 2.3 or later. ```python import torch from diffusers import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_single_file( "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors", torch_dtype=torch.float16, ) pipe.enable_model_cpu_offload() image = pipe("a picture of a cat holding a sign that says hello world").images[0] image.save('sd3-single-file-t5-fp8.png') ``` ### Loading the single file checkpoint for the Stable Diffusion 3.5 Transformer Model ```python import torch from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline transformer = SD3Transformer2DModel.from_single_file( "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors", torch_dtype=torch.bfloat16, ) pipe = StableDiffusion3Pipeline.from_pretrained( "stabilityai/stable-diffusion-3.5-large", transformer=transformer, torch_dtype=torch.bfloat16, ) pipe.enable_model_cpu_offload() image = pipe("a cat holding a sign that says hello world").images[0] image.save("sd35.png") ``` ## StableDiffusion3Pipeline [[autodoc]] StableDiffusion3Pipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_safe.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Safe Stable Diffusion Safe Stable Diffusion was proposed in [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105) and mitigates inappropriate degeneration from Stable Diffusion models because they're trained on unfiltered web-crawled datasets. For instance Stable Diffusion may unexpectedly generate nudity, violence, images depicting self-harm, and otherwise offensive content. Safe Stable Diffusion is an extension of Stable Diffusion that drastically reduces this type of content. The abstract from the paper is: *Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.* ## Tips Use the `safety_concept` property of [`StableDiffusionPipelineSafe`] to check and edit the current safety concept: ```python >>> from diffusers import StableDiffusionPipelineSafe >>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") >>> pipeline.safety_concept 'an image showing hate, harassment, violence, suffering, humiliation, harm, suicide, sexual, nudity, bodily fluids, blood, obscene gestures, illegal activity, drug use, theft, vandalism, weapons, child abuse, brutality, cruelty' ``` For each image generation the active concept is also contained in [`StableDiffusionSafePipelineOutput`]. There are 4 configurations (`SafetyConfig.WEAK`, `SafetyConfig.MEDIUM`, `SafetyConfig.STRONG`, and `SafetyConfig.MAX`) that can be applied: ```python >>> from diffusers import StableDiffusionPipelineSafe >>> from diffusers.pipelines.stable_diffusion_safe import SafetyConfig >>> pipeline = StableDiffusionPipelineSafe.from_pretrained("AIML-TUDA/stable-diffusion-safe") >>> prompt = "the four horsewomen of the apocalypse, painting by tom of finland, gaston bussiere, craig mullins, j. c. leyendecker" >>> out = pipeline(prompt=prompt, **SafetyConfig.MAX) ``` > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! ## StableDiffusionPipelineSafe [[autodoc]] StableDiffusionPipelineSafe - all - __call__ ## StableDiffusionSafePipelineOutput [[autodoc]] pipelines.stable_diffusion_safe.StableDiffusionSafePipelineOutput - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md ================================================ # Stable Diffusion XL
LoRA MPS
Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. The abstract from the paper is: *We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* ## Tips - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE) - Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). - SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. - SDXL output images can be improved by making use of a refiner model in an image-to-image setting. - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. > [!TIP] > To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide. > > Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! ## StableDiffusionXLPipeline [[autodoc]] StableDiffusionXLPipeline - all - __call__ ## StableDiffusionXLImg2ImgPipeline [[autodoc]] StableDiffusionXLImg2ImgPipeline - all - __call__ ## StableDiffusionXLInpaintPipeline [[autodoc]] StableDiffusionXLInpaintPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/svd.md ================================================ # Stable Video Diffusion Stable Video Diffusion was proposed in [Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets](https://hf.co/papers/2311.15127) by Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach. The abstract from the paper is: *We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at this https URL.* > [!TIP] > To learn how to use Stable Video Diffusion, take a look at the [Stable Video Diffusion](../../../using-diffusers/svd) guide. > >
> > Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the [base](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [extended frame](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) checkpoints! ## Tips Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. Check out the [Text or image-to-video](../../../using-diffusers/text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. ## StableVideoDiffusionPipeline [[autodoc]] StableVideoDiffusionPipeline ## StableVideoDiffusionPipelineOutput [[autodoc]] pipelines.stable_video_diffusion.StableVideoDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/text2img.md ================================================ # Text-to-image
LoRA
The Stable Diffusion model was created by researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), [Runway](https://github.com/runwayml), and [LAION](https://laion.ai/). The [`StableDiffusionPipeline`] is capable of generating photorealistic images given any text input. It's trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs. Latent diffusion is the research on top of which Stable Diffusion was built. It was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer. The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion.* > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## StableDiffusionPipeline [[autodoc]] StableDiffusionPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - enable_vae_tiling - disable_vae_tiling - load_textual_inversion - from_single_file - load_lora_weights - save_lora_weights ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_diffusion/upscale.md ================================================ # Super-resolution
LoRA
The Stable Diffusion upscaler diffusion model was created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/), and [LAION](https://laion.ai/). It is used to enhance the resolution of input images by a factor of 4. > [!TIP] > Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! > > If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis) and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! ## StableDiffusionUpscalePipeline [[autodoc]] StableDiffusionUpscalePipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## StableDiffusionPipelineOutput [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/stable_unclip.md ================================================ # Stable unCLIP
LoRA
Stable unCLIP checkpoints are finetuned from [Stable Diffusion 2.1](./stable_diffusion/stable_diffusion_2) checkpoints to condition on CLIP image embeddings. Stable unCLIP still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation. The abstract from the paper is: *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* ## Tips Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`). ### Text-to-Image Generation Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha): ```python import torch from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline from diffusers.models import PriorTransformer from transformers import CLIPTokenizer, CLIPTextModelWithProjection prior_model_id = "kakaobrain/karlo-v1-alpha" data_type = torch.float16 prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type) prior_text_model_id = "openai/clip-vit-large-patch14" prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id) prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type) prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler") prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config) stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small" pipe = StableUnCLIPPipeline.from_pretrained( stable_unclip_model_id, torch_dtype=data_type, variant="fp16", prior_tokenizer=prior_tokenizer, prior_text_encoder=prior_text_model, prior=prior, prior_scheduler=prior_scheduler, ) pipe = pipe.to("cuda") wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" image = pipe(prompt=wave_prompt).images[0] image ``` > [!WARNING] > For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. ### Text guided Image-to-Image Variation ```python from diffusers import StableUnCLIPImg2ImgPipeline from diffusers.utils import load_image import torch pipe = StableUnCLIPImg2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16" ) pipe = pipe.to("cuda") url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png" init_image = load_image(url) images = pipe(init_image).images images[0].save("variation_image.png") ``` Optionally, you can also pass a prompt to `pipe` such as: ```python prompt = "A fantasy landscape, trending on artstation" image = pipe(init_image, prompt=prompt).images[0] image ``` > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## StableUnCLIPPipeline [[autodoc]] StableUnCLIPPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## StableUnCLIPImg2ImgPipeline [[autodoc]] StableUnCLIPImg2ImgPipeline - all - __call__ - enable_attention_slicing - disable_attention_slicing - enable_vae_slicing - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/text_to_video.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Text-to-video
LoRA
[ModelScope Text-to-Video Technical Report](https://huggingface.co/papers/2308.06571) is by Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang. The abstract from the paper is: *This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at https://modelscope.cn/models/damo/text-to-video-synthesis/summary.* You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense). ## Usage example ### `text-to-video-ms-1.7b` Let's start by generating a short video with the default length of 16 frames (2s at 8 fps): ```python import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe = pipe.to("cuda") prompt = "Spiderman is surfing" video_frames = pipe(prompt).frames[0] video_path = export_to_video(video_frames) video_path ``` Diffusers supports different optimization techniques to improve the latency and memory footprint of a pipeline. Since videos are often more memory-heavy than images, we can enable CPU offloading and VAE slicing to keep the memory footprint at bay. Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing: ```python import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.enable_model_cpu_offload() # memory optimization pipe.enable_vae_slicing() prompt = "Darth Vader surfing a wave" video_frames = pipe(prompt, num_frames=64).frames[0] video_path = export_to_video(video_frames) video_path ``` It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above. We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion: ```python import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() prompt = "Spiderman is surfing" video_frames = pipe(prompt, num_inference_steps=25).frames[0] video_path = export_to_video(video_frames) video_path ``` Here are some sample outputs:
An astronaut riding a horse.
An astronaut riding a horse.
Darth vader surfing in waves.
Darth vader surfing in waves.
### `cerspense/zeroscope_v2_576w` & `cerspense/zeroscope_v2_XL` Zeroscope are watermark-free model and have been trained on specific sizes such as `576x320` and `1024x576`. One should first generate a video using the lower resolution checkpoint [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) with [`TextToVideoSDPipeline`], which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zeroscope_v2_XL`](https://huggingface.co/cerspense/zeroscope_v2_XL). ```py import torch from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video from PIL import Image pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() # memory optimization pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) pipe.enable_vae_slicing() prompt = "Darth Vader surfing a wave" video_frames = pipe(prompt, num_frames=24).frames[0] video_path = export_to_video(video_frames) video_path ``` Now the video can be upscaled: ```py pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() # memory optimization pipe.unet.enable_forward_chunking(chunk_size=1, dim=1) pipe.enable_vae_slicing() video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames] video_frames = pipe(prompt, video=video, strength=0.6).frames[0] video_path = export_to_video(video_frames) video_path ``` Here are some sample outputs:
Darth vader surfing in waves.
Darth vader surfing in waves.
## Tips Video generation is memory-intensive and one way to reduce your memory usage is to set `enable_forward_chunking` on the pipeline's UNet so you don't run the entire feedforward layer at once. Breaking it up into chunks in a loop is more efficient. Check out the [Text or image-to-video](../../using-diffusers/text-img2vid) guide for more details about how certain parameters can affect video generation and how to optimize inference by reducing memory usage. > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## TextToVideoSDPipeline [[autodoc]] TextToVideoSDPipeline - all - __call__ ## VideoToVideoSDPipeline [[autodoc]] VideoToVideoSDPipeline - all - __call__ ## TextToVideoSDPipelineOutput [[autodoc]] pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/text_to_video_zero.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # Text2Video-Zero
LoRA
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). Text2Video-Zero enables zero-shot video generation using either: 1. A textual prompt 2. A prompt combined with guidance from poses or edges 3. Video Instruct-Pix2Pix (instruction-guided video editing) Results are temporally consistent and closely follow the guidance and textual prompts. ![teaser-img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/t2v_zero_teaser.png) The abstract from the paper is: *Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.* You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://huggingface.co/papers/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). ## Usage example ### Text-To-Video To generate a video from prompt, run the following Python code: ```python import torch from diffusers import TextToVideoZeroPipeline import imageio model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") prompt = "A panda is playing guitar on times square" result = pipe(prompt=prompt).images result = [(r * 255).astype("uint8") for r in result] imageio.mimsave("video.mp4", result, fps=4) ``` You can change these parameters in the pipeline call: * Motion field strength (see the [paper](https://huggingface.co/papers/2303.13439), Sect. 3.3.1): * `motion_field_strength_x` and `motion_field_strength_y`. Default: `motion_field_strength_x=12`, `motion_field_strength_y=12` * `T` and `T'` (see the [paper](https://huggingface.co/papers/2303.13439), Sect. 3.3.1) * `t0` and `t1` in the range `{0, ..., num_inference_steps}`. Default: `t0=45`, `t1=48` * Video length: * `video_length`, the number of frames video_length to be generated. Default: `video_length=8` We can also generate longer videos by doing the processing in a chunk-by-chunk manner: ```python import torch from diffusers import TextToVideoZeroPipeline import numpy as np model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") seed = 0 video_length = 24 #24 ÷ 4fps = 6 seconds chunk_size = 8 prompt = "A panda is playing guitar on times square" # Generate the video chunk-by-chunk result = [] chunk_ids = np.arange(0, video_length, chunk_size - 1) generator = torch.Generator(device="cuda") for i in range(len(chunk_ids)): print(f"Processing chunk {i + 1} / {len(chunk_ids)}") ch_start = chunk_ids[i] ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1] # Attach the first frame for Cross Frame Attention frame_ids = [0] + list(range(ch_start, ch_end)) # Fix the seed for the temporal consistency generator.manual_seed(seed) output = pipe(prompt=prompt, video_length=len(frame_ids), generator=generator, frame_ids=frame_ids) result.append(output.images[1:]) # Concatenate chunks and save result = np.concatenate(result) result = [(r * 255).astype("uint8") for r in result] imageio.mimsave("video.mp4", result, fps=4) ``` - #### SDXL Support In order to use the SDXL model when generating a video from prompt, use the `TextToVideoZeroSDXLPipeline` pipeline: ```python import torch from diffusers import TextToVideoZeroSDXLPipeline model_id = "stabilityai/stable-diffusion-xl-base-1.0" pipe = TextToVideoZeroSDXLPipeline.from_pretrained( model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True ).to("cuda") ``` ### Text-To-Video with Pose Control To generate a video from prompt with additional pose control 1. Download a demo video ```python from huggingface_hub import hf_hub_download filename = "__assets__/poses_skeleton_gifs/dance1_corr.mp4" repo_id = "PAIR/Text2Video-Zero" video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) ``` 2. Read video containing extracted pose images ```python from PIL import Image import imageio reader = imageio.get_reader(video_path, "ffmpeg") frame_count = 8 pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] ``` To extract pose from actual video, read [ControlNet documentation](controlnet). 3. Run `StableDiffusionControlNetPipeline` with our custom attention processor ```python import torch from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda") # Set the attention processor pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) # fix latents for all frames latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) prompt = "Darth Vader dancing in a desert" result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images imageio.mimsave("video.mp4", result, fps=4) ``` - #### SDXL Support Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL: ```python import torch from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0' model_id = 'stabilityai/stable-diffusion-xl-base-1.0' controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16 ).to('cuda') # Set the attention processor pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) # fix latents for all frames latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1) prompt = "Darth Vader dancing in a desert" result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images imageio.mimsave("video.mp4", result, fps=4) ``` ### Text-To-Video with Edge Control To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny). ### Video Instruct-Pix2Pix To perform text-guided video editing (with [InstructPix2Pix](pix2pix)): 1. Download a demo video ```python from huggingface_hub import hf_hub_download filename = "__assets__/pix2pix video/camel.mp4" repo_id = "PAIR/Text2Video-Zero" video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) ``` 2. Read video from path ```python from PIL import Image import imageio reader = imageio.get_reader(video_path, "ffmpeg") frame_count = 8 video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] ``` 3. Run `StableDiffusionInstructPix2PixPipeline` with our custom attention processor ```python import torch from diffusers import StableDiffusionInstructPix2PixPipeline from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor model_id = "timbrooks/instruct-pix2pix" pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3)) prompt = "make it Van Gogh Starry Night style" result = pipe(prompt=[prompt] * len(video), image=video).images imageio.mimsave("edited_video.mp4", result, fps=4) ``` ### DreamBooth specialization Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and [Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model: 1. Download a demo video ```python from huggingface_hub import hf_hub_download filename = "__assets__/canny_videos_mp4/girl_turning.mp4" repo_id = "PAIR/Text2Video-Zero" video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename) ``` 2. Read video from path ```python from PIL import Image import imageio reader = imageio.get_reader(video_path, "ffmpeg") frame_count = 8 canny_edges = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] ``` 3. Run `StableDiffusionControlNetPipeline` with custom trained DreamBooth model ```python import torch from diffusers import StableDiffusionControlNetPipeline, ControlNetModel from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor # set model id to custom model model_id = "PAIR/text2video-zero-controlnet-canny-avatar" controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) pipe = StableDiffusionControlNetPipeline.from_pretrained( model_id, controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda") # Set the attention processor pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2)) # fix latents for all frames latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(canny_edges), 1, 1, 1) prompt = "oil painting of a beautiful girl avatar style" result = pipe(prompt=[prompt] * len(canny_edges), image=canny_edges, latents=latents).images imageio.mimsave("video.mp4", result, fps=4) ``` You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## TextToVideoZeroPipeline [[autodoc]] TextToVideoZeroPipeline - all - __call__ ## TextToVideoZeroSDXLPipeline [[autodoc]] TextToVideoZeroSDXLPipeline - all - __call__ ## TextToVideoPipelineOutput [[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/unclip.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # unCLIP [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo). The abstract from the paper is following: *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## UnCLIPPipeline [[autodoc]] UnCLIPPipeline - all - __call__ ## UnCLIPImageVariationPipeline [[autodoc]] UnCLIPImageVariationPipeline - all - __call__ ## ImagePipelineOutput [[autodoc]] pipelines.ImagePipelineOutput ================================================ FILE: docs/source/en/api/pipelines/unidiffuser.md ================================================ > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model. # UniDiffuser
LoRA
The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. The abstract from the paper is: *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).* You can find the original codebase at [thu-ml/unidiffuser](https://github.com/thu-ml/unidiffuser) and additional checkpoints at [thu-ml](https://huggingface.co/thu-ml). > [!WARNING] > There is currently an issue on PyTorch 1.X where the output images are all black or the pixel values become `NaNs`. This issue can be mitigated by switching to PyTorch 2.X. This pipeline was contributed by [dg845](https://github.com/dg845). ❤️ ## Usage Examples Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks: ### Unconditional Image and Text Generation Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a [`UniDiffuserPipeline`] will produce a (image, text) pair: ```python import torch from diffusers import UniDiffuserPipeline device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Unconditional image and text generation. The generation task is automatically inferred. sample = pipe(num_inference_steps=20, guidance_scale=8.0) image = sample.images[0] text = sample.text[0] image.save("unidiffuser_joint_sample_image.png") print(text) ``` This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution. Note that the generation task is inferred from the inputs used when calling the pipeline. It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]: ```python # Equivalent to the above. pipe.set_joint_mode() sample = pipe(num_inference_steps=20, guidance_scale=8.0) ``` When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode. You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode. You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively): ```python # Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance # Image-only generation pipe.set_image_mode() sample_image = pipe(num_inference_steps=20).images[0] # Text-only generation pipe.set_text_mode() sample_text = pipe(num_inference_steps=20).text[0] ``` ### Text-to-Image Generation UniDiffuser is also capable of sampling from conditional distributions; that is, the distribution of images conditioned on a text prompt or the distribution of texts conditioned on an image. Here is an example of sampling from the conditional image distribution (text-to-image generation or text-conditioned image generation): ```python import torch from diffusers import UniDiffuserPipeline device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Text-to-image generation prompt = "an elephant under the sea" sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) t2i_image = sample.images[0] t2i_image ``` The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`]. ### Image-to-Text Generation Similarly, UniDiffuser can also produce text samples given an image (image-to-text or image-conditioned text generation): ```python import torch from diffusers import UniDiffuserPipeline from diffusers.utils import load_image device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Image-to-text generation image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" init_image = load_image(image_url).resize((512, 512)) sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) i2t_text = sample.text[0] print(i2t_text) ``` The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuserPipeline.set_image_to_text_mode`]. ### Image Variation The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation. This produces a new image which is semantically similar to the input image: ```python import torch from diffusers import UniDiffuserPipeline from diffusers.utils import load_image device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Image variation can be performed with an image-to-text generation followed by a text-to-image generation: # 1. Image-to-text generation image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" init_image = load_image(image_url).resize((512, 512)) sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) i2t_text = sample.text[0] print(i2t_text) # 2. Text-to-image generation sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0) final_image = sample.images[0] final_image.save("unidiffuser_image_variation_sample.png") ``` ### Text Variation Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation: ```python import torch from diffusers import UniDiffuserPipeline device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Text variation can be performed with a text-to-image generation followed by a image-to-text generation: # 1. Text-to-image generation prompt = "an elephant under the sea" sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) t2i_image = sample.images[0] t2i_image.save("unidiffuser_text2img_sample_image.png") # 2. Image-to-text generation sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0) final_prompt = sample.text[0] print(final_prompt) ``` > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## UniDiffuserPipeline [[autodoc]] UniDiffuserPipeline - all - __call__ ## ImageTextPipelineOutput [[autodoc]] pipelines.ImageTextPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/value_guided_sampling.md ================================================ # Value-guided planning > [!WARNING] > 🧪 This is an experimental pipeline for reinforcement learning! This pipeline is based on the [Planning with Diffusion for Flexible Behavior Synthesis](https://huggingface.co/papers/2205.09991) paper by Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine. The abstract from the paper is: *Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.* You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/drive/1rXm8CX4ZdN5qivjJ2lhwhkOmt_m0CvU0#scrollTo=6HXJvhyqcITc&uniqifier=1). The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning). > [!TIP] > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. ## ValueGuidedRLPipeline [[autodoc]] diffusers.experimental.ValueGuidedRLPipeline ================================================ FILE: docs/source/en/api/pipelines/visualcloze.md ================================================ # VisualCloze [VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning](https://huggingface.co/papers/2504.07960) is an innovative in-context learning based universal image generation framework that offers key capabilities: 1. Support for various in-domain tasks 2. Generalization to unseen tasks through in-context learning 3. Unify multiple tasks into one step and generate both target image and intermediate results 4. Support reverse-engineering conditions from target images ## Overview The abstract from the paper is: *Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures. The codes, dataset, and models are available at https://visualcloze.github.io.* ## Inference ### Model loading VisualCloze is a two-stage cascade pipeline, containing `VisualClozeGenerationPipeline` and `VisualClozeUpsamplingPipeline`. - In `VisualClozeGenerationPipeline`, each image is downsampled before concatenating images into a grid layout, avoiding excessively high resolutions. VisualCloze releases two models suitable for diffusers, i.e., [VisualClozePipeline-384](https://huggingface.co/VisualCloze/VisualClozePipeline-384) and [VisualClozePipeline-512](https://huggingface.co/VisualCloze/VisualClozePipeline-384), which downsample images to resolutions of 384 and 512, respectively. - `VisualClozeUpsamplingPipeline` uses [SDEdit](https://huggingface.co/papers/2108.01073) to enable high-resolution image synthesis. The `VisualClozePipeline` integrates both stages to support convenient end-to-end sampling, while also allowing users to utilize each pipeline independently as needed. ### Input Specifications #### Task and Content Prompts - Task prompt: Required to describe the generation task intention - Content prompt: Optional description or caption of the target image - When content prompt is not needed, pass `None` - For batch inference, pass `List[str|None]` #### Image Input Format - Format: `List[List[Image|None]]` - Structure: - All rows except the last represent in-context examples - Last row represents the current query (target image set to `None`) - For batch inference, pass `List[List[List[Image|None]]]` #### Resolution Control - Default behavior: - Initial generation in the first stage: area of ${pipe.resolution}^2$ - Upsampling in the second stage: 3x factor - Custom resolution: Adjust using `upsampling_height` and `upsampling_width` parameters ### Examples For comprehensive examples covering a wide range of tasks, please refer to the [Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze) and [GitHub Repository](https://github.com/lzyhha/VisualCloze). Below are simple examples for three cases: mask-to-image conversion, edge detection, and subject-driven generation. #### Example for mask2image ```python import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) pipe.to("cuda") # Load in-context images (make sure the paths are correct and accessible) image_paths = [ # in-context examples [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg'), ], # query with the target image [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg'), None, # No image needed for the target image ], ] # Task and content prompt task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding." content_prompt = """Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography.""" # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_width=1344, upsampling_height=768, upsampling_strength=0.4, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png") ``` #### Example for edge-detection ```python import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) pipe.to("cuda") # Load in-context images (make sure the paths are correct and accessible) image_paths = [ # in-context examples [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_image.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_edge.jpg'), ], [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_image.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_edge.jpg'), ], # query with the target image [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_query_image.jpg'), None, # No image needed for the target image ], ] # Task and content prompt task_prompt = "Each row illustrates a pathway from [IMAGE1] a sharp and beautifully composed photograph to [IMAGE2] edge map with natural well-connected outlines using a clear logical task." content_prompt = "" # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_width=864, upsampling_height=1152, upsampling_strength=0.4, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png") ``` #### Example for subject-driven generation ```python import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) pipe.to("cuda") # Load in-context images (make sure the paths are correct and accessible) image_paths = [ # in-context examples [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_reference.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_depth.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_image.jpg'), ], [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_reference.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_depth.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_image.jpg'), ], # query with the target image [ load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_reference.jpg'), load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_depth.jpg'), None, # No image needed for the target image ], ] # Task and content prompt task_prompt = """Each row describes a process that begins with [IMAGE1] an image containing the key object, [IMAGE2] depth map revealing gray-toned spatial layers and results in [IMAGE3] an image with artistic qualitya high-quality image with exceptional detail.""" content_prompt = """A vintage porcelain collector's item. Beneath a blossoming cherry tree in early spring, this treasure is photographed up close, with soft pink petals drifting through the air and vibrant blossoms framing the scene.""" # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_width=1024, upsampling_height=1024, upsampling_strength=0.2, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png") ``` #### Utilize each pipeline independently ```python import torch from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline from diffusers.utils import load_image from PIL import Image pipe = VisualClozeGenerationPipeline.from_pretrained( "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16 ) pipe.to("cuda") image_paths = [ # in-context examples [ load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg" ), load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg" ), ], # query with the target image [ load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg" ), None, # No image needed for the target image ], ] task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding." content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography." # Stage 1: Generate initial image image = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0), ).images[0][0] # Stage 2 (optional): Upsample the generated image pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe) pipe_upsample.to("cuda") mask_image = Image.new("RGB", image.size, (255, 255, 255)) image = pipe_upsample( image=image, mask_image=mask_image, prompt=content_prompt, width=1344, height=768, strength=0.4, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0), ).images[0] image.save("visualcloze.png") ``` ## VisualClozePipeline [[autodoc]] VisualClozePipeline - all - __call__ ## VisualClozeGenerationPipeline [[autodoc]] VisualClozeGenerationPipeline - all - __call__ ================================================ FILE: docs/source/en/api/pipelines/wan.md ================================================ # Wan [Wan-2.1](https://huggingface.co/papers/2503.20314) by the Wan Team. *This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at [this https URL](https://github.com/Wan-Video/Wan2.1).* You can find all the original Wan2.1 checkpoints under the [Wan-AI](https://huggingface.co/Wan-AI) organization. The following Wan models are supported in Diffusers: - [Wan 2.1 T2V 1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers) - [Wan 2.1 T2V 14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers) - [Wan 2.1 I2V 14B - 480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P-Diffusers) - [Wan 2.1 I2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P-Diffusers) - [Wan 2.1 FLF2V 14B - 720P](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers) - [Wan 2.1 VACE 1.3B](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B-diffusers) - [Wan 2.1 VACE 14B](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B-diffusers) - [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) - [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) - [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) - [Wan 2.2 Animate 14B](https://huggingface.co/Wan-AI/Wan2.2-Animate-14B-Diffusers) > [!TIP] > Click on the Wan models in the right sidebar for more examples of video generation. ### Text-to-Video Generation The example below demonstrates how to generate a video from text optimized for memory or inference speed. Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. The Wan2.1 text-to-video model below requires ~13GB of VRAM. ```py # pip install ftfy import torch import numpy as np from diffusers import AutoModel, WanPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.hooks.group_offloading import apply_group_offloading from diffusers.utils import export_to_video, load_image from transformers import UMT5EncoderModel text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16) vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32) transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) # group-offloading onload_device = torch.device("cuda") offload_device = torch.device("cpu") apply_group_offloading(text_encoder, onload_device=onload_device, offload_device=offload_device, offload_type="block_level", num_blocks_per_group=4 ) transformer.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True ) pipeline = WanPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") prompt = """ The camera rushes from far to near in a low-angle shot, revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=81, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=16) ``` [Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs. ```py # pip install ftfy import torch import numpy as np from diffusers import AutoModel, WanPipeline from diffusers.hooks.group_offloading import apply_group_offloading from diffusers.utils import export_to_video, load_image from transformers import UMT5EncoderModel text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16) vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32) transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16) pipeline = WanPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch.bfloat16 ) pipeline.to("cuda") # torch.compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer = torch.compile( pipeline.transformer, mode="max-autotune", fullgraph=True ) prompt = """ The camera rushes from far to near in a low-angle shot, revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=81, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=16) ``` ### First-Last-Frame-to-Video Generation The example below demonstrates how to use the image-to-video pipeline to generate a video using a text description, a starting frame, and an ending frame. ```python import numpy as np import torch import torchvision.transforms.functional as TF from diffusers import AutoencoderKLWan, WanImageToVideoPipeline from diffusers.utils import export_to_video, load_image from transformers import CLIPVisionModel model_id = "Wan-AI/Wan2.1-FLF2V-14B-720P-diffusers" image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32) vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipe = WanImageToVideoPipeline.from_pretrained( model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16 ) pipe.to("cuda") first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png") last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png") def aspect_ratio_resize(image, pipe, max_area=720 * 1280): aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value image = image.resize((width, height)) return image, height, width def center_crop_resize(image, height, width): # Calculate resize ratio to match first frame dimensions resize_ratio = max(width / image.width, height / image.height) # Resize the image width = round(image.width * resize_ratio) height = round(image.height * resize_ratio) size = [width, height] image = TF.center_crop(image, size) return image, height, width first_frame, height, width = aspect_ratio_resize(first_frame, pipe) if last_frame.size != first_frame.size: last_frame, _, _ = center_crop_resize(last_frame, height, width) prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective." output = pipe( image=first_frame, last_image=last_frame, prompt=prompt, height=height, width=width, guidance_scale=5.5 ).frames[0] export_to_video(output, "output.mp4", fps=16) ``` ### Any-to-Video Controllable Generation Wan VACE supports various generation techniques which achieve controllable video generation. Some of the capabilities include: - Control to Video (Depth, Pose, Sketch, Flow, Grayscale, Scribble, Layout, Boundary Box, etc.). Recommended library for preprocessing videos to obtain control videos: [huggingface/controlnet_aux]() - Image/Video to Video (first frame, last frame, starting clip, ending clip, random clips) - Inpainting and Outpainting - Subject to Video (faces, object, characters, etc.) - Composition to Video (reference anything, animate anything, swap anything, expand anything, move anything, etc.) The code snippets available in [this](https://github.com/huggingface/diffusers/pull/11582) pull request demonstrate some examples of how videos can be generated with controllability signals. The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color. ### Wan-Animate: Unified Character Animation and Replacement with Holistic Replication [Wan-Animate](https://huggingface.co/papers/2509.14055) by the Wan Team. *We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.* The project page: https://humanaigc.github.io/wan-animate This model was mostly contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz). #### Usage The Wan-Animate pipeline supports two modes of operation: 1. **Animation Mode** (default): Animates a character image based on motion and expression from reference videos 2. **Replacement Mode**: Replaces a character in a background video with a new character while preserving the scene ##### Prerequisites Before using the pipeline, you need to preprocess your reference video to extract: - **Pose video**: Contains skeletal keypoints representing body motion - **Face video**: Contains facial feature representations for expression control For replacement mode, you additionally need: - **Background video**: The original video containing the scene - **Mask video**: A mask indicating where to generate content (white) vs. preserve original (black) > [!NOTE] > Raw videos should not be used for inputs such as `pose_video`, which the pipeline expects to be preprocessed to extract the proper information. Preprocessing scripts to prepare these inputs are available in the [original Wan-Animate repository](https://github.com/Wan-Video/Wan2.2?tab=readme-ov-file#1-preprocessing). Integration of these preprocessing steps into Diffusers is planned for a future release. The example below demonstrates how to use the Wan-Animate pipeline: ```python import numpy as np import torch from diffusers import AutoencoderKLWan, WanAnimatePipeline from diffusers.utils import export_to_video, load_image, load_video model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers" vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16) pipe.to("cuda") # Load character image and preprocessed videos image = load_image("path/to/character.jpg") pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features # Resize image to match VAE constraints def aspect_ratio_resize(image, pipe, max_area=720 * 1280): aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value image = image.resize((width, height)) return image, height, width image, height, width = aspect_ratio_resize(image, pipe) prompt = "A person dancing energetically in a studio with dynamic lighting and professional camera work" negative_prompt = "blurry, low quality, distorted, deformed, static, poorly drawn" # Generate animated video output = pipe( image=image, pose_video=pose_video, face_video=face_video, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, segment_frame_length=77, guidance_scale=1.0, mode="animate", # Animation mode (default) ).frames[0] export_to_video(output, "animated_character.mp4", fps=30) ``` ```python import numpy as np import torch from diffusers import AutoencoderKLWan, WanAnimatePipeline from diffusers.utils import export_to_video, load_image, load_video model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers" vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16) pipe.to("cuda") # Load all required inputs for replacement mode image = load_image("path/to/new_character.jpg") pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features background_video = load_video("path/to/background_video.mp4") # Original scene mask_video = load_video("path/to/mask_video.mp4") # Black: preserve, White: generate # Resize image to match video dimensions def aspect_ratio_resize(image, pipe, max_area=720 * 1280): aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value image = image.resize((width, height)) return image, height, width image, height, width = aspect_ratio_resize(image, pipe) prompt = "A person seamlessly integrated into the scene with consistent lighting and environment" negative_prompt = "blurry, low quality, inconsistent lighting, floating, disconnected from scene" # Replace character in background video output = pipe( image=image, pose_video=pose_video, face_video=face_video, background_video=background_video, mask_video=mask_video, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, segment_frame_lengths=77, guidance_scale=1.0, mode="replace", # Replacement mode ).frames[0] export_to_video(output, "character_replaced.mp4", fps=30) ``` ```python import numpy as np import torch from diffusers import AutoencoderKLWan, WanAnimatePipeline from diffusers.utils import export_to_video, load_image, load_video model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers" vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32) pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16) pipe.to("cuda") image = load_image("path/to/character.jpg") pose_video = load_video("path/to/pose_video.mp4") face_video = load_video("path/to/face_video.mp4") def aspect_ratio_resize(image, pipe, max_area=720 * 1280): aspect_ratio = image.height / image.width mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1] height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value image = image.resize((width, height)) return image, height, width image, height, width = aspect_ratio_resize(image, pipe) prompt = "A person dancing energetically in a studio" negative_prompt = "blurry, low quality" # Advanced: Use temporal guidance and custom callback def callback_fn(pipe, step_index, timestep, callback_kwargs): # You can modify latents or other tensors here print(f"Step {step_index}, Timestep {timestep}") return callback_kwargs output = pipe( image=image, pose_video=pose_video, face_video=face_video, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, segment_frame_length=77, num_inference_steps=50, guidance_scale=5.0, prev_segment_conditioning_frames=5, # Use 5 frames for temporal guidance (1 or 5 recommended) callback_on_step_end=callback_fn, callback_on_step_end_tensor_inputs=["latents"], ).frames[0] export_to_video(output, "animated_advanced.mp4", fps=30) ``` #### Key Parameters - **mode**: Choose between `"animate"` (default) or `"replace"` - **prev_segment_conditioning_frames**: Number of frames for temporal guidance (1 or 5 recommended). Using 5 provides better temporal consistency but requires more memory - **guidance_scale**: Controls how closely the output follows the text prompt. Higher values (5-7) produce results more aligned with the prompt. For Wan-Animate, CFG is disabled by default (`guidance_scale=1.0`) but can be enabled to support negative prompts and finer control over facial expressions. (Note that CFG will only target the text prompt and face conditioning.) ## Notes - Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
Show example code ```py # pip install ftfy import torch from diffusers import AutoModel, WanPipeline from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler from diffusers.utils import export_to_video vae = AutoModel.from_pretrained( "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32 ) pipeline = WanPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", vae=vae, torch_dtype=torch.bfloat16 ) pipeline.scheduler = UniPCMultistepScheduler.from_config( pipeline.scheduler.config, flow_shift=5.0 ) pipeline.to("cuda") pipeline.load_lora_weights("benjamin-paine/steamboat-willie-1.3b", adapter_name="steamboat-willie") pipeline.set_adapters("steamboat-willie") pipeline.enable_model_cpu_offload() # use "steamboat willie style" to trigger the LoRA prompt = """ steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field. """ output = pipeline( prompt=prompt, num_frames=81, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=16) ```
- [`WanTransformer3DModel`] and [`AutoencoderKLWan`] supports loading from single files with [`~loaders.FromSingleFileMixin.from_single_file`].
Show example code ```py # pip install ftfy import torch from diffusers import WanPipeline, WanTransformer3DModel, AutoencoderKLWan vae = AutoencoderKLWan.from_single_file( "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors" ) transformer = WanTransformer3DModel.from_single_file( "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/diffusion_models/wan2.1_t2v_1.3B_bf16.safetensors", torch_dtype=torch.bfloat16 ) pipeline = WanPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", vae=vae, transformer=transformer, torch_dtype=torch.bfloat16 ) ```
- Set the [`AutoencoderKLWan`] dtype to `torch.float32` for better decoding quality. - The number of frames per second (fps) or `k` should be calculated by `4 * k + 1`. - Try lower `shift` values (`2.0` to `5.0`) for lower resolution videos and higher `shift` values (`7.0` to `12.0`) for higher resolution images. - Wan 2.1 and 2.2 support using [LightX2V LoRAs](https://huggingface.co/Kijai/WanVideo_comfy/tree/main/Lightx2v) to speed up inference. Using them on Wan 2.2 is slightly more involed. Refer to [this code snippet](https://github.com/huggingface/diffusers/pull/12040#issuecomment-3144185272) to learn more. - Wan 2.2 has two denoisers. By default, LoRAs are only loaded into the first denoiser. One can set `load_into_transformer_2=True` to load LoRAs into the second denoiser. Refer to [this](https://github.com/huggingface/diffusers/pull/12074#issue-3292620048) and [this](https://github.com/huggingface/diffusers/pull/12074#issuecomment-3155896144) examples to learn more. ## WanPipeline [[autodoc]] WanPipeline - all - __call__ ## WanImageToVideoPipeline [[autodoc]] WanImageToVideoPipeline - all - __call__ ## WanVACEPipeline [[autodoc]] WanVACEPipeline - all - __call__ ## WanVideoToVideoPipeline [[autodoc]] WanVideoToVideoPipeline - all - __call__ ## WanAnimatePipeline [[autodoc]] WanAnimatePipeline - all - __call__ ## WanPipelineOutput [[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput ================================================ FILE: docs/source/en/api/pipelines/wuerstchen.md ================================================ # Würstchen > [!WARNING] > This pipeline is deprecated but it can still be used. However, we won't test the pipeline anymore and won't accept any changes to it. If you run into any issues, reinstall the last Diffusers version that supported this model.
LoRA
[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville. The abstract from the paper is: *We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.* ## Würstchen Overview Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. ## Würstchen v2 comes to Diffusers After the initial paper release, we have improved numerous things in the architecture, training and sampling, making Würstchen competitive to current state-of-the-art models in many ways. We are excited to release this new version together with Diffusers. Here is a list of the improvements. - Higher resolution (1024x1024 up to 2048x2048) - Faster inference - Multi Aspect Resolution Sampling - Better quality We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: - v2-base - v2-aesthetic - **(default)** v2-interpolated (50% interpolation between v2-base and v2-aesthetic) We recommend using v2-interpolated, as it has a nice touch of both photorealism and aesthetics. Use v2-base for finetunings as it does not have a style bias and use v2-aesthetic for very artistic generations. A comparison can be seen here: ## Text-to-Image Generation For the sake of usability, Würstchen can be used with a single pipeline. This pipeline can be used as follows: ```python import torch from diffusers import AutoPipelineForText2Image from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda") caption = "Anthropomorphic cat dressed as a fire fighter" images = pipe( caption, width=1024, height=1536, prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, prior_guidance_scale=4.0, num_images_per_prompt=2, ).images ``` For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637). ```python import torch from diffusers import WuerstchenDecoderPipeline, WuerstchenPriorPipeline from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS device = "cuda" dtype = torch.float16 num_images_per_prompt = 2 prior_pipeline = WuerstchenPriorPipeline.from_pretrained( "warp-ai/wuerstchen-prior", torch_dtype=dtype ).to(device) decoder_pipeline = WuerstchenDecoderPipeline.from_pretrained( "warp-ai/wuerstchen", torch_dtype=dtype ).to(device) caption = "Anthropomorphic cat dressed as a fire fighter" negative_prompt = "" prior_output = prior_pipeline( prompt=caption, height=1024, width=1536, timesteps=DEFAULT_STAGE_C_TIMESTEPS, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=num_images_per_prompt, ) decoder_output = decoder_pipeline( image_embeddings=prior_output.image_embeddings, prompt=caption, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", ).images[0] decoder_output ``` ## Speed-Up Inference You can make use of `torch.compile` function and gain a speed-up of about 2-3x: ```python prior_pipeline.prior = torch.compile(prior_pipeline.prior, mode="reduce-overhead", fullgraph=True) decoder_pipeline.decoder = torch.compile(decoder_pipeline.decoder, mode="reduce-overhead", fullgraph=True) ``` ## Limitations - Due to the high compression employed by Würstchen, generations can lack a good amount of detail. To our human eye, this is especially noticeable in faces, hands etc. - **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution after 1024x1024 is 1152x1152 - The model lacks the ability to render correct text in images - The model often does not achieve photorealism - Difficult compositional prompts are hard for the model The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen). ## WuerstchenCombinedPipeline [[autodoc]] WuerstchenCombinedPipeline - all - __call__ ## WuerstchenPriorPipeline [[autodoc]] WuerstchenPriorPipeline - all - __call__ ## WuerstchenPriorPipelineOutput [[autodoc]] pipelines.wuerstchen.pipeline_wuerstchen_prior.WuerstchenPriorPipelineOutput ## WuerstchenDecoderPipeline [[autodoc]] WuerstchenDecoderPipeline - all - __call__ ## Citation ```bibtex @misc{pernias2023wuerstchen, title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models}, author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville}, year={2023}, eprint={2306.00637}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ================================================ FILE: docs/source/en/api/pipelines/z_image.md ================================================ # Z-Image
LoRA
[Z-Image](https://huggingface.co/papers/2511.22699) is a powerful and highly efficient image generation model with 6B parameters. Currently there's only one model with two more to be released: |Model|Hugging Face| |---|---| |Z-Image-Turbo|https://huggingface.co/Tongyi-MAI/Z-Image-Turbo| ## Z-Image-Turbo Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence. ## Image-to-image Use [`ZImageImg2ImgPipeline`] to transform an existing image based on a text prompt. ```python import torch from diffusers import ZImageImg2ImgPipeline from diffusers.utils import load_image pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda") url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024)) prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors" image = pipe( prompt, image=init_image, strength=0.6, num_inference_steps=9, guidance_scale=0.0, generator=torch.Generator("cuda").manual_seed(42), ).images[0] image.save("zimage_img2img.png") ``` ## Inpainting Use [`ZImageInpaintPipeline`] to inpaint specific regions of an image based on a text prompt and mask. ```python import torch import numpy as np from PIL import Image from diffusers import ZImageInpaintPipeline from diffusers.utils import load_image pipe = ZImageInpaintPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda") url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024)) # Create a mask (white = inpaint, black = preserve) mask = np.zeros((1024, 1024), dtype=np.uint8) mask[256:768, 256:768] = 255 # Inpaint center region mask_image = Image.fromarray(mask) prompt = "A beautiful lake with mountains in the background" image = pipe( prompt, image=init_image, mask_image=mask_image, strength=1.0, num_inference_steps=9, guidance_scale=0.0, generator=torch.Generator("cuda").manual_seed(42), ).images[0] image.save("zimage_inpaint.png") ``` ## ZImagePipeline [[autodoc]] ZImagePipeline - all - __call__ ## ZImageImg2ImgPipeline [[autodoc]] ZImageImg2ImgPipeline - all - __call__ ## ZImageInpaintPipeline [[autodoc]] ZImageInpaintPipeline - all - __call__ ================================================ FILE: docs/source/en/api/quantization.md ================================================ # Quantization Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. > [!TIP] > Learn how to quantize models in the [Quantization](../quantization/overview) guide. ## PipelineQuantizationConfig [[autodoc]] quantizers.PipelineQuantizationConfig ## BitsAndBytesConfig [[autodoc]] quantizers.quantization_config.BitsAndBytesConfig ## GGUFQuantizationConfig [[autodoc]] quantizers.quantization_config.GGUFQuantizationConfig ## QuantoConfig [[autodoc]] quantizers.quantization_config.QuantoConfig ## TorchAoConfig [[autodoc]] quantizers.quantization_config.TorchAoConfig ## DiffusersQuantizer [[autodoc]] quantizers.base.DiffusersQuantizer ================================================ FILE: docs/source/en/api/schedulers/cm_stochastic_iterative.md ================================================ # CMStochasticIterativeScheduler [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever introduced a multistep and onestep scheduler (Algorithm 1) that is capable of generating good samples in one or a small number of steps. The abstract from the paper is: *Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). ## CMStochasticIterativeScheduler [[autodoc]] CMStochasticIterativeScheduler ## CMStochasticIterativeSchedulerOutput [[autodoc]] schedulers.scheduling_consistency_models.CMStochasticIterativeSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/consistency_decoder.md ================================================ # ConsistencyDecoderScheduler This scheduler is a part of the [`ConsistencyDecoderPipeline`] and was introduced in [DALL-E 3](https://openai.com/dall-e-3). The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models). ## ConsistencyDecoderScheduler [[autodoc]] schedulers.scheduling_consistency_decoder.ConsistencyDecoderScheduler ================================================ FILE: docs/source/en/api/schedulers/cosine_dpm.md ================================================ # CosineDPMSolverMultistepScheduler The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021). It is being used in the [Stable Audio Open](https://huggingface.co/papers/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tools) codebase. This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). ## CosineDPMSolverMultistepScheduler [[autodoc]] CosineDPMSolverMultistepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/ddim.md ================================================ # DDIMScheduler [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. The abstract from the paper is: *Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples 10× to 50× faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.* The original codebase of this paper can be found at [ermongroup/ddim](https://github.com/ermongroup/ddim), and you can contact the author on [tsong.me](https://tsong.me/). ## Tips The paper [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) claims that a mismatch between the training and inference settings leads to suboptimal inference generation results for Stable Diffusion. To fix this, the authors propose: > [!WARNING] > 🧪 This is an experimental feature! 1. rescale the noise schedule to enforce zero terminal signal-to-noise ratio (SNR) ```py pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, rescale_betas_zero_snr=True) ``` 2. train a model with `v_prediction` (add the following argument to the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) or [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) scripts) ```bash --prediction_type="v_prediction" ``` 3. change the sampler to always start from the last timestep ```py pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing") ``` 4. rescale classifier-free guidance to prevent over-exposure ```py image = pipe(prompt, guidance_rescale=0.7).images[0] ``` For example: ```py from diffusers import DiffusionPipeline, DDIMScheduler import torch pipe = DiffusionPipeline.from_pretrained("ptx0/pseudo-journey-v2", torch_dtype=torch.float16) pipe.scheduler = DDIMScheduler.from_config( pipe.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing" ) pipe.to("cuda") prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k" image = pipe(prompt, guidance_rescale=0.7).images[0] image ``` ## DDIMScheduler [[autodoc]] DDIMScheduler ## DDIMSchedulerOutput [[autodoc]] schedulers.scheduling_ddim.DDIMSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/ddim_cogvideox.md ================================================ # CogVideoXDDIMScheduler `CogVideoXDDIMScheduler` is based on [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502), specifically for CogVideoX models. ## CogVideoXDDIMScheduler [[autodoc]] CogVideoXDDIMScheduler ================================================ FILE: docs/source/en/api/schedulers/ddim_inverse.md ================================================ # DDIMInverseScheduler `DDIMInverseScheduler` is the inverted scheduler from [Denoising Diffusion Implicit Models](https://huggingface.co/papers/2010.02502) (DDIM) by Jiaming Song, Chenlin Meng and Stefano Ermon. The implementation is mostly based on the DDIM inversion definition from [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794). ## DDIMInverseScheduler [[autodoc]] DDIMInverseScheduler ================================================ FILE: docs/source/en/api/schedulers/ddpm.md ================================================ # DDPMScheduler [Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2006.11239) (DDPM) by Jonathan Ho, Ajay Jain and Pieter Abbeel proposes a diffusion based model of the same name. In the context of the 🤗 Diffusers library, DDPM refers to the discrete denoising scheduler from the paper as well as the pipeline. The abstract from the paper is: *We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at [this https URL](https://github.com/hojonathanho/diffusion).* ## DDPMScheduler [[autodoc]] DDPMScheduler ## DDPMSchedulerOutput [[autodoc]] schedulers.scheduling_ddpm.DDPMSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/deis.md ================================================ # DEISMultistepScheduler Diffusion Exponential Integrator Sampler (DEIS) is proposed in [Fast Sampling of Diffusion Models with Exponential Integrator](https://huggingface.co/papers/2204.13902) by Qinsheng Zhang and Yongxin Chen. `DEISMultistepScheduler` is a fast high order solver for diffusion ordinary differential equations (ODEs). This implementation modifies the polynomial fitting formula in log-rho space instead of the original linear `t` space in the DEIS paper. The modification enjoys closed-form coefficients for exponential multistep update instead of replying on the numerical solver. The abstract from the paper is: *The past few years have witnessed the great success of Diffusion models~(DMs) in generating high-fidelity samples in generative modeling tasks. A major limitation of the DM is its notoriously slow sampling procedure which normally requires hundreds to thousands of time discretization steps of the learned diffusion process to reach the desired accuracy. Our goal is to develop a fast sampling method for DMs with a much less number of steps while retaining high sample quality. To this end, we systematically analyze the sampling procedure in DMs and identify key factors that affect the sample quality, among which the method of discretization is most crucial. By carefully examining the learned diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS). It is based on the Exponential Integrator designed for discretizing ordinary differential equations (ODEs) and leverages a semilinear structure of the learned diffusion process to reduce the discretization error. The proposed method can be applied to any DMs and can generate high-fidelity samples in as few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU to generate 50k images from CIFAR10. Moreover, by directly using pre-trained DMs, we achieve the state-of-art sampling performance when the number of score function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID, and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at [this https URL](https://github.com/qsh-zh/deis).* ## Tips It is recommended to set `solver_order` to 2 or 3, while `solver_order=1` is equivalent to [`DDIMScheduler`]. Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set `thresholding=True` to use the dynamic thresholding. ## DEISMultistepScheduler [[autodoc]] DEISMultistepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/dpm_discrete.md ================================================ # KDPM2DiscreteScheduler The `KDPM2DiscreteScheduler` is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). ## KDPM2DiscreteScheduler [[autodoc]] KDPM2DiscreteScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/dpm_discrete_ancestral.md ================================================ # KDPM2AncestralDiscreteScheduler The `KDPM2DiscreteScheduler` with ancestral sampling is inspired by the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). The original codebase can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion). ## KDPM2AncestralDiscreteScheduler [[autodoc]] KDPM2AncestralDiscreteScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/dpm_sde.md ================================================ # DPMSolverSDEScheduler The `DPMSolverSDEScheduler` is inspired by the stochastic sampler from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper, and the scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/). ## DPMSolverSDEScheduler [[autodoc]] DPMSolverSDEScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/edm_euler.md ================================================ # EDMEulerScheduler The Karras formulation of the Euler scheduler (Algorithm 2) from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). ## EDMEulerScheduler [[autodoc]] EDMEulerScheduler ## EDMEulerSchedulerOutput [[autodoc]] schedulers.scheduling_edm_euler.EDMEulerSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/edm_multistep_dpm_solver.md ================================================ # EDMDPMSolverMultistepScheduler `EDMDPMSolverMultistepScheduler` is a [Karras formulation](https://huggingface.co/papers/2206.00364) of `DPMSolverMultistepScheduler`, a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality samples, and it can generate quite good samples even in 10 steps. ## EDMDPMSolverMultistepScheduler [[autodoc]] EDMDPMSolverMultistepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/euler.md ================================================ # EulerDiscreteScheduler The Euler scheduler (Algorithm 2) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L51) implementation by [Katherine Crowson](https://github.com/crowsonkb/). ## EulerDiscreteScheduler [[autodoc]] EulerDiscreteScheduler ## EulerDiscreteSchedulerOutput [[autodoc]] schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/euler_ancestral.md ================================================ # EulerAncestralDiscreteScheduler A scheduler that uses ancestral sampling with Euler method steps. This is a fast scheduler which can often generate good outputs in 20-30 steps. The scheduler is based on the original [k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L72) implementation by [Katherine Crowson](https://github.com/crowsonkb/). ## EulerAncestralDiscreteScheduler [[autodoc]] EulerAncestralDiscreteScheduler ## EulerAncestralDiscreteSchedulerOutput [[autodoc]] schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/flow_match_euler_discrete.md ================================================ # FlowMatchEulerDiscreteScheduler `FlowMatchEulerDiscreteScheduler` is based on the flow-matching sampling introduced in [Stable Diffusion 3](https://huggingface.co/papers/2403.03206). ## FlowMatchEulerDiscreteScheduler [[autodoc]] FlowMatchEulerDiscreteScheduler ================================================ FILE: docs/source/en/api/schedulers/flow_match_heun_discrete.md ================================================ # FlowMatchHeunDiscreteScheduler `FlowMatchHeunDiscreteScheduler` is based on the flow-matching sampling introduced in [EDM](https://huggingface.co/papers/2403.03206). ## FlowMatchHeunDiscreteScheduler [[autodoc]] FlowMatchHeunDiscreteScheduler ================================================ FILE: docs/source/en/api/schedulers/helios.md ================================================ # HeliosScheduler `HeliosScheduler` is based on the pyramidal flow-matching sampling introduced in [Helios](https://huggingface.co/papers). ## HeliosScheduler [[autodoc]] HeliosScheduler scheduling_helios ================================================ FILE: docs/source/en/api/schedulers/helios_dmd.md ================================================ # HeliosDMDScheduler `HeliosDMDScheduler` is based on the pyramidal flow-matching sampling introduced in [Helios](https://huggingface.co/papers). ## HeliosDMDScheduler [[autodoc]] HeliosDMDScheduler scheduling_helios_dmd ================================================ FILE: docs/source/en/api/schedulers/heun.md ================================================ # HeunDiscreteScheduler The Heun scheduler (Algorithm 1) is from the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) paper by Karras et al. The scheduler is ported from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library and created by [Katherine Crowson](https://github.com/crowsonkb/). ## HeunDiscreteScheduler [[autodoc]] HeunDiscreteScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/ipndm.md ================================================ # IPNDMScheduler `IPNDMScheduler` is a fourth-order Improved Pseudo Linear Multistep scheduler. The original implementation can be found at [crowsonkb/v-diffusion-pytorch](https://github.com/crowsonkb/v-diffusion-pytorch/blob/987f8985e38208345c1959b0ea767a625831cc9b/diffusion/sampling.py#L296). ## IPNDMScheduler [[autodoc]] IPNDMScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/lcm.md ================================================ # Latent Consistency Model Multistep Scheduler ## Overview Multistep and onestep scheduler (Algorithm 3) introduced alongside latent consistency models in the paper [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. This scheduler should be able to generate good samples from [`LatentConsistencyModelPipeline`] in 1-8 steps. ## LCMScheduler [[autodoc]] LCMScheduler ================================================ FILE: docs/source/en/api/schedulers/lms_discrete.md ================================================ # LMSDiscreteScheduler `LMSDiscreteScheduler` is a linear multistep scheduler for discrete beta schedules. The scheduler is ported from and created by [Katherine Crowson](https://github.com/crowsonkb/), and the original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). ## LMSDiscreteScheduler [[autodoc]] LMSDiscreteScheduler ## LMSDiscreteSchedulerOutput [[autodoc]] schedulers.scheduling_lms_discrete.LMSDiscreteSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/multistep_dpm_solver.md ================================================ # DPMSolverMultistepScheduler `DPMSolverMultistepScheduler` is a multistep scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality samples, and it can generate quite good samples even in 10 steps. ## Tips It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. The SDE variant of DPMSolver and DPM-Solver++ is also supported, but only for the first and second-order solvers. This is a fast SDE solver for the reverse diffusion SDE. It is recommended to use the second-order `sde-dpmsolver++`. ## DPMSolverMultistepScheduler [[autodoc]] DPMSolverMultistepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/multistep_dpm_solver_cogvideox.md ================================================ # CogVideoXDPMScheduler `CogVideoXDPMScheduler` is based on [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095), specifically for CogVideoX models. ## CogVideoXDPMScheduler [[autodoc]] CogVideoXDPMScheduler ================================================ FILE: docs/source/en/api/schedulers/multistep_dpm_solver_inverse.md ================================================ # DPMSolverMultistepInverse `DPMSolverMultistepInverse` is the inverted scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. The implementation is mostly based on the DDIM inversion definition of [Null-text Inversion for Editing Real Images using Guided Diffusion Models](https://huggingface.co/papers/2211.09794) and notebook implementation of the [`DiffEdit`] latent inversion from [Xiang-cd/DiffEdit-stable-diffusion](https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/diffedit.ipynb). ## Tips Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use the dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. ## DPMSolverMultistepInverseScheduler [[autodoc]] DPMSolverMultistepInverseScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/overview.md ================================================ # Schedulers 🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`. Depending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output: - during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model - during *inference*, a scheduler defines how to update a sample based on a pretrained model's output Many schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below: | A1111/k-diffusion | 🤗 Diffusers | Usage | |---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------| | DPM++ 2M | [`DPMSolverMultistepScheduler`] | | | DPM++ 2M Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` | | DPM++ 2M SDE | [`DPMSolverMultistepScheduler`] | init with `algorithm_type="sde-dpmsolver++"` | | DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` and `algorithm_type="sde-dpmsolver++"` | | DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` | | DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` | | DPM++ SDE | [`DPMSolverSinglestepScheduler`] | | | DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`] | init with `use_karras_sigmas=True` | | DPM2 | [`KDPM2DiscreteScheduler`] | | | DPM2 Karras | [`KDPM2DiscreteScheduler`] | init with `use_karras_sigmas=True` | | DPM2 a | [`KDPM2AncestralDiscreteScheduler`] | | | DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True` | | DPM adaptive | N/A | | | DPM fast | N/A | | | Euler | [`EulerDiscreteScheduler`] | | | Euler a | [`EulerAncestralDiscreteScheduler`] | | | Heun | [`HeunDiscreteScheduler`] | | | LMS | [`LMSDiscreteScheduler`] | | | LMS Karras | [`LMSDiscreteScheduler`] | init with `use_karras_sigmas=True` | | N/A | [`DEISMultistepScheduler`] | | | N/A | [`UniPCMultistepScheduler`] | | ## Noise schedules and schedule types | A1111/k-diffusion | 🤗 Diffusers | |--------------------------|----------------------------------------------------------------------------| | Karras | init with `use_karras_sigmas=True` | | sgm_uniform | init with `timestep_spacing="trailing"` | | simple | init with `timestep_spacing="trailing"` | | exponential | init with `timestep_spacing="linspace"`, `use_exponential_sigmas=True` | | beta | init with `timestep_spacing="linspace"`, `use_beta_sigmas=True` | All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers. ## SchedulerMixin [[autodoc]] SchedulerMixin ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ## KarrasDiffusionSchedulers [`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed. The different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32). ## PushToHubMixin [[autodoc]] utils.PushToHubMixin ================================================ FILE: docs/source/en/api/schedulers/pndm.md ================================================ # PNDMScheduler `PNDMScheduler`, or pseudo numerical methods for diffusion models, uses more advanced ODE integration techniques like the Runge-Kutta and linear multi-step method. The original implementation can be found at [crowsonkb/k-diffusion](https://github.com/crowsonkb/k-diffusion/blob/481677d114f6ea445aa009cf5bd7a9cdee909e47/k_diffusion/sampling.py#L181). ## PNDMScheduler [[autodoc]] PNDMScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/repaint.md ================================================ # RePaintScheduler `RePaintScheduler` is a DDPM-based inpainting scheduler for unsupervised inpainting with extreme masks. It is designed to be used with the [`RePaintPipeline`], and it is based on the paper [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](https://huggingface.co/papers/2201.09865) by Andreas Lugmayr et al. The abstract from the paper is: *Free-form inpainting is the task of adding new content to an image in the regions specified by an arbitrary binary mask. Most existing approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior. To condition the generation process, we only alter the reverse diffusion iterations by sampling the unmasked regions using the given image information. Since this technique does not modify or condition the original DDPM network itself, the model produces high-quality and diverse output images for any inpainting form. We validate our method for both faces and general-purpose image inpainting using standard and extreme masks. RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for at least five out of six mask distributions. GitHub Repository: [this http URL](http://git.io/RePaint).* The original implementation can be found at [andreas128/RePaint](https://github.com/andreas128/). ## RePaintScheduler [[autodoc]] RePaintScheduler ## RePaintSchedulerOutput [[autodoc]] schedulers.scheduling_repaint.RePaintSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/score_sde_ve.md ================================================ # ScoreSdeVeScheduler `ScoreSdeVeScheduler` is a variance exploding stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. The abstract from the paper is: *Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* ## ScoreSdeVeScheduler [[autodoc]] ScoreSdeVeScheduler ## SdeVeOutput [[autodoc]] schedulers.scheduling_sde_ve.SdeVeOutput ================================================ FILE: docs/source/en/api/schedulers/score_sde_vp.md ================================================ # ScoreSdeVpScheduler `ScoreSdeVpScheduler` is a variance preserving stochastic differential equation (SDE) scheduler. It was introduced in the [Score-Based Generative Modeling through Stochastic Differential Equations](https://huggingface.co/papers/2011.13456) paper by Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole. The abstract from the paper is: *Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.* > [!WARNING] > 🚧 This scheduler is under construction! ## ScoreSdeVpScheduler [[autodoc]] schedulers.deprecated.scheduling_sde_vp.ScoreSdeVpScheduler ================================================ FILE: docs/source/en/api/schedulers/singlestep_dpm_solver.md ================================================ # DPMSolverSinglestepScheduler `DPMSolverSinglestepScheduler` is a single step scheduler from [DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps](https://huggingface.co/papers/2206.00927) and [DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models](https://huggingface.co/papers/2211.01095) by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPMSolver (and the improved version DPMSolver++) is a fast dedicated high-order solver for diffusion ODEs with convergence order guarantee. Empirically, DPMSolver sampling with only 20 steps can generate high-quality samples, and it can generate quite good samples even in 10 steps. The original implementation can be found at [LuChengTHU/dpm-solver](https://github.com/LuChengTHU/dpm-solver). ## Tips It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `algorithm_type="dpmsolver++"` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. ## DPMSolverSinglestepScheduler [[autodoc]] DPMSolverSinglestepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/stochastic_karras_ve.md ================================================ # KarrasVeScheduler `KarrasVeScheduler` is a stochastic sampler tailored to variance-expanding (VE) models. It is based on the [Elucidating the Design Space of Diffusion-Based Generative Models](https://huggingface.co/papers/2206.00364) and [Score-based generative modeling through stochastic differential equations](https://huggingface.co/papers/2011.13456) papers. ## KarrasVeScheduler [[autodoc]] KarrasVeScheduler ## KarrasVeOutput [[autodoc]] schedulers.deprecated.scheduling_karras_ve.KarrasVeOutput ================================================ FILE: docs/source/en/api/schedulers/tcd.md ================================================ # TCDScheduler [Trajectory Consistency Distillation](https://huggingface.co/papers/2402.19159) by Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao and Tat-Jen Cham introduced a Strategic Stochastic Sampling (Algorithm 4) that is capable of generating good samples in a small number of steps. Distinguishing it as an advanced iteration of the multistep scheduler (Algorithm 1) in the [Consistency Models](https://huggingface.co/papers/2303.01469), Strategic Stochastic Sampling specifically tailored for the trajectory consistency function. The abstract from the paper is: *Latent Consistency Model (LCM) extends the Consistency Model to the latent space and leverages the guided consistency distillation technique to achieve impressive performance in accelerating text-to-image synthesis. However, we observed that LCM struggles to generate images with both clarity and detailed intricacy. To address this limitation, we initially delve into and elucidate the underlying causes. Our investigation identifies that the primary issue stems from errors in three distinct areas. Consequently, we introduce Trajectory Consistency Distillation (TCD), which encompasses trajectory consistency function and strategic stochastic sampling. The trajectory consistency function diminishes the distillation errors by broadening the scope of the self-consistency boundary condition and endowing the TCD with the ability to accurately trace the entire trajectory of the Probability Flow ODE. Additionally, strategic stochastic sampling is specifically designed to circumvent the accumulated errors inherent in multi-step consistency sampling, which is meticulously tailored to complement the TCD model. Experiments demonstrate that TCD not only significantly enhances image quality at low NFEs but also yields more detailed results compared to the teacher model at high NFEs.* The original codebase can be found at [jabir-zheng/TCD](https://github.com/jabir-zheng/TCD). ## TCDScheduler [[autodoc]] TCDScheduler ## TCDSchedulerOutput [[autodoc]] schedulers.scheduling_tcd.TCDSchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/unipc.md ================================================ # UniPCMultistepScheduler `UniPCMultistepScheduler` is a training-free framework designed for fast sampling of diffusion models. It was introduced in [UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models](https://huggingface.co/papers/2302.04867) by Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu. It consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders. UniPC is by design model-agnostic, supporting pixel-space/latent-space DPMs on unconditional/conditional sampling. It can also be applied to both noise prediction and data prediction models. The corrector UniC can be also applied after any off-the-shelf solvers to increase the order of accuracy. The abstract from the paper is: *Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM is time-consuming due to the multiple evaluations of the denoising network, making it more and more important to accelerate the sampling of DPMs. Despite recent progress in designing fast samplers, existing methods still cannot generate satisfying images in many applications where fewer steps (e.g., <10) are favored. In this paper, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods, especially in extremely few steps. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256×256 (conditional) with only 10 function evaluations. Code is available at [this https URL](https://github.com/wl-zhao/UniPC).* ## Tips It is recommended to set `solver_order` to 2 for guide sampling, and `solver_order=3` for unconditional sampling. Dynamic thresholding from [Imagen](https://huggingface.co/papers/2205.11487) is supported, and for pixel-space diffusion models, you can set both `predict_x0=True` and `thresholding=True` to use dynamic thresholding. This thresholding method is unsuitable for latent-space diffusion models such as Stable Diffusion. ## UniPCMultistepScheduler [[autodoc]] UniPCMultistepScheduler ## SchedulerOutput [[autodoc]] schedulers.scheduling_utils.SchedulerOutput ================================================ FILE: docs/source/en/api/schedulers/vq_diffusion.md ================================================ # VQDiffusionScheduler `VQDiffusionScheduler` converts the transformer model's output into a sample for the unnoised image at the previous diffusion timestep. It was introduced in [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://huggingface.co/papers/2111.14822) by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo. The abstract from the paper is: *We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.* ## VQDiffusionScheduler [[autodoc]] VQDiffusionScheduler ## VQDiffusionSchedulerOutput [[autodoc]] schedulers.scheduling_vq_diffusion.VQDiffusionSchedulerOutput ================================================ FILE: docs/source/en/api/utilities.md ================================================ # Utilities Utility and helper functions for working with 🤗 Diffusers. ## numpy_to_pil [[autodoc]] utils.numpy_to_pil ## pt_to_pil [[autodoc]] utils.pt_to_pil ## load_image [[autodoc]] utils.load_image ## load_video [[autodoc]] utils.load_video ## export_to_gif [[autodoc]] utils.export_to_gif ## export_to_video [[autodoc]] utils.export_to_video ## make_image_grid [[autodoc]] utils.make_image_grid ## randn_tensor [[autodoc]] utils.torch_utils.randn_tensor ## apply_layerwise_casting [[autodoc]] hooks.layerwise_casting.apply_layerwise_casting ## apply_group_offloading [[autodoc]] hooks.group_offloading.apply_group_offloading ================================================ FILE: docs/source/en/api/video_processor.md ================================================ # Video Processor The [`VideoProcessor`] provides a unified API for video pipelines to prepare inputs for VAE encoding and post-processing outputs once they're decoded. The class inherits [`VaeImageProcessor`] so it includes transformations such as resizing, normalization, and conversion between PIL Image, PyTorch, and NumPy arrays. ## VideoProcessor [[autodoc]] video_processor.VideoProcessor.preprocess_video [[autodoc]] video_processor.VideoProcessor.postprocess_video ================================================ FILE: docs/source/en/community_projects.md ================================================ # Community Projects Welcome to Community Projects. This space is dedicated to showcasing the incredible work and innovative applications created by our vibrant community using the `diffusers` library. This section aims to: - Highlight diverse and inspiring projects built with `diffusers` - Foster knowledge sharing within our community - Provide real-world examples of how `diffusers` can be leveraged Happy exploring, and thank you for being part of the Diffusers community!
Project Name Description
dream-textures Stable Diffusion built-in to Blender
HiDiffusion Increases the resolution and speed of your diffusion model by only adding a single line of code
IC-Light IC-Light is a project to manipulate the illumination of images
InstantID InstantID : Zero-shot Identity-Preserving Generation in Seconds
IOPaint Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.
Kohya Gradio GUI for Kohya's Stable Diffusion trainers
MagicAnimate MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
OOTDiffusion Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
SD.Next SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
stable-dreamfusion Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
StoryDiffusion StoryDiffusion can create a magic story by generating consistent images and videos.
StreamDiffusion A Pipeline-Level Solution for Real-Time Interactive Generation
Stable Diffusion Server A server configured for Inpainting/Generation/img2img with one stable diffusion model
Model Search Search models on Civitai and Hugging Face
Skrample Fully modular scheduler functions with 1st class diffusers integration.
================================================ FILE: docs/source/en/conceptual/contribution.md ================================================ # How to contribute to Diffusers 🧨 We ❤️ contributions from the open-source community! Everyone is welcome, and all types of participation –not just code– are valued and appreciated. Answering questions, helping others, reaching out, and improving the documentation are all immensely valuable to the community, so don't be afraid and get involved if you're up for it! Everyone is encouraged to start by saying 👋 in our public Discord channel. We discuss the latest trends in diffusion models, ask questions, show off personal projects, help each other with contributions, or just hang out ☕. Join us on Discord Whichever way you choose to contribute, we strive to be part of an open, welcoming, and kind community. Please, read our [code of conduct](https://github.com/huggingface/diffusers/blob/main/CODE_OF_CONDUCT.md) and be mindful to respect it during your interactions. We also recommend you become familiar with the [ethical guidelines](https://huggingface.co/docs/diffusers/conceptual/ethical_guidelines) that guide our project and ask you to adhere to the same principles of transparency and responsibility. We enormously value feedback from the community, so please do not be afraid to speak up if you believe you have valuable feedback that can help improve the library - every message, comment, issue, and pull request (PR) is read and considered. ## Overview You can contribute in many ways ranging from answering questions on issues and discussions to adding new diffusion models to the core library. In the following, we give an overview of different ways to contribute, ranked by difficulty in ascending order. All of them are valuable to the community. * 1. Asking and answering questions on [the Diffusers discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers) or on [Discord](https://discord.gg/G7tWnz98XR). * 2. Opening new issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues/new/choose) or new discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions/new/choose). * 3. Answering issues on [the GitHub Issues tab](https://github.com/huggingface/diffusers/issues) or discussions on [the GitHub Discussions tab](https://github.com/huggingface/diffusers/discussions). * 4. Fix a simple issue, marked by the "Good first issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). * 5. Contribute to the [documentation](https://github.com/huggingface/diffusers/tree/main/docs/source). * 6. Contribute a [Community Pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3Acommunity-examples). * 7. Contribute to the [examples](https://github.com/huggingface/diffusers/tree/main/examples). * 8. Fix a more difficult issue, marked by the "Good second issue" label, see [here](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22). * 9. Add a new pipeline, model, or scheduler, see ["New Pipeline/Model"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) and ["New scheduler"](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) issues. For this contribution, please have a look at [Design Philosophy](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md). As said before, **all contributions are valuable to the community**. In the following, we will explain each contribution a bit more in detail. For all contributions 4 - 9, you will need to open a PR. It is explained in detail how to do so in [Opening a pull request](#how-to-open-a-pr). ### 1. Asking and answering questions on the Diffusers discussion forum or on the Diffusers Discord Any question or comment related to the Diffusers library can be asked on the [discussion forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/) or on [Discord](https://discord.gg/G7tWnz98XR). Such questions and comments include (but are not limited to): - Reports of training or inference experiments in an attempt to share knowledge - Presentation of personal projects - Questions to non-official training examples - Project proposals - General feedback - Paper summaries - Asking for help on personal projects that build on top of the Diffusers library - General questions - Ethical questions regarding diffusion models - ... Every question that is asked on the forum or on Discord actively encourages the community to publicly share knowledge and might very well help a beginner in the future who has the same question you're having. Please do pose any questions you might have. In the same spirit, you are of immense help to the community by answering such questions because this way you are publicly documenting knowledge for everybody to learn from. **Please** keep in mind that the more effort you put into asking or answering a question, the higher the quality of the publicly documented knowledge. In the same way, well-posed and well-answered questions create a high-quality knowledge database accessible to everybody, while badly posed questions or answers reduce the overall quality of the public knowledge database. In short, a high quality question or answer is *precise*, *concise*, *relevant*, *easy-to-understand*, *accessible*, and *well-formatted/well-posed*. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. **NOTE about channels**: [*The forum*](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) is much better indexed by search engines, such as Google. Posts are ranked by popularity rather than chronologically. Hence, it's easier to look up questions and answers that we posted some time ago. In addition, questions and answers posted in the forum can easily be linked to. In contrast, *Discord* has a chat-like format that invites fast back-and-forth communication. While it will most likely take less time for you to get an answer to your question on Discord, your question won't be visible anymore over time. Also, it's much harder to find information that was posted a while back on Discord. We therefore strongly recommend using the forum for high-quality questions and answers in an attempt to create long-lasting knowledge for the community. If discussions on Discord lead to very interesting answers and conclusions, we recommend posting the results on the forum to make the information more available for future readers. ### 2. Opening new issues on the GitHub issues tab The 🧨 Diffusers library is robust and reliable thanks to the users who notify us of the problems they encounter. So thank you for reporting an issue. Remember, GitHub issues are reserved for technical questions directly related to the Diffusers library, bug reports, feature requests, or feedback on the library design. In a nutshell, this means that everything that is **not** related to the **code of the Diffusers library** (including the documentation) should **not** be asked on GitHub, but rather on either the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). **Please consider the following guidelines when opening a new issue**: - Make sure you have searched whether your issue has already been asked before (use the search bar on GitHub under Issues). - Please never report a new issue on another (related) issue. If another issue is highly related, please open a new issue nevertheless and link to the related issue. - Make sure your issue is written in English. Please use one of the great, free online translation services, such as [DeepL](https://www.deepl.com/translator) to translate from your native language to English if you are not comfortable in English. - Check whether your issue might be solved by updating to the newest Diffusers version. Before posting your issue, please make sure that `python -c "import diffusers; print(diffusers.__version__)"` is higher or matches the latest Diffusers version. - Remember that the more effort you put into opening a new issue, the higher the quality of your answer will be and the better the overall quality of the Diffusers issues. New issues usually include the following. #### 2.1. Reproducible, minimal bug reports A bug report should always have a reproducible code snippet and be as minimal and concise as possible. This means in more detail: - Narrow the bug down as much as you can, **do not just dump your whole code file**. - Format your code. - Do not include any external libraries except for Diffusers depending on them. - **Always** provide all necessary information about your environment; for this, you can run: `diffusers-cli env` in your shell and copy-paste the displayed information to the issue. - Explain the issue. If the reader doesn't know what the issue is and why it is an issue, (s)he cannot solve it. - **Always** make sure the reader can reproduce your issue with as little effort as possible. If your code snippet cannot be run because of missing libraries or undefined variables, the reader cannot help you. Make sure your reproducible code snippet is as minimal as possible and can be copy-pasted into a simple Python shell. - If in order to reproduce your issue a model and/or dataset is required, make sure the reader has access to that model or dataset. You can always upload your model or dataset to the [Hub](https://huggingface.co) to make it easily downloadable. Try to keep your model and dataset as small as possible, to make the reproduction of your issue as effortless as possible. For more information, please have a look through the [How to write a good issue](#how-to-write-a-good-issue) section. You can open a bug report [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&projects=&template=bug-report.yml). #### 2.2. Feature requests A world-class feature request addresses the following points: 1. Motivation first: * Is it related to a problem/frustration with the library? If so, please explain why. Providing a code snippet that demonstrates the problem is best. * Is it related to something you would need for a project? We'd love to hear about it! * Is it something you worked on and think could benefit the community? Awesome! Tell us what problem it solved for you. 2. Write a *full paragraph* describing the feature; 3. Provide a **code snippet** that demonstrates its future use; 4. In case this is related to a paper, please attach a link; 5. Attach any additional information (drawings, screenshots, etc.) you think may help. You can open a feature request [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=). #### 2.3 Feedback Feedback about the library design and why it is good or not good helps the core maintainers immensely to build a user-friendly library. To understand the philosophy behind the current design philosophy, please have a look [here](https://huggingface.co/docs/diffusers/conceptual/philosophy). If you feel like a certain design choice does not fit with the current design philosophy, please explain why and how it should be changed. If a certain design choice follows the design philosophy too much, hence restricting use cases, explain why and how it should be changed. If a certain design choice is very useful for you, please also leave a note as this is great feedback for future design decisions. You can open an issue about feedback [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). #### 2.4 Technical questions Technical questions are mainly about why certain code of the library was written in a certain way, or what a certain part of the code does. Please make sure to link to the code in question and please provide details on why this part of the code is difficult to understand. You can open an issue about a technical question [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=bug&template=bug-report.yml). #### 2.5 Proposal to add a new model, scheduler, or pipeline If the diffusion model community released a new model, pipeline, or scheduler that you would like to see in the Diffusers library, please provide the following information: * Short description of the diffusion pipeline, model, or scheduler and link to the paper or public release. * Link to any of its open-source implementation(s). * Link to the model weights if they are available. If you are willing to contribute to the model yourself, let us know so we can best guide you. Also, don't forget to tag the original author of the component (model, scheduler, pipeline, etc.) by GitHub handle if you can find it. You can open a request for a model/pipeline/scheduler [here](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=New+model%2Fpipeline%2Fscheduler&template=new-model-addition.yml). ### 3. Answering issues on the GitHub issues tab Answering issues on GitHub might require some technical knowledge of Diffusers, but we encourage everybody to give it a try even if you are not 100% certain that your answer is correct. Some tips to give a high-quality answer to an issue: - Be as concise and minimal as possible. - Stay on topic. An answer to the issue should concern the issue and only the issue. - Provide links to code, papers, or other sources that prove or encourage your point. - Answer in code. If a simple code snippet is the answer to the issue or shows how the issue can be solved, please provide a fully reproducible code snippet. Also, many issues tend to be simply off-topic, duplicates of other issues, or irrelevant. It is of great help to the maintainers if you can answer such issues, encouraging the author of the issue to be more precise, provide the link to a duplicated issue or redirect them to [the forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63) or [Discord](https://discord.gg/G7tWnz98XR). If you have verified that the issued bug report is correct and requires a correction in the source code, please have a look at the next sections. For all of the following contributions, you will need to open a PR. It is explained in detail how to do so in the [Opening a pull request](#how-to-open-a-pr) section. ### 4. Fixing a "Good first issue" *Good first issues* are marked by the [Good first issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. Usually, the issue already explains how a potential solution should look so that it is easier to fix. If the issue hasn't been closed and you would like to try to fix this issue, you can just leave a message "I would like to try this issue.". There are usually three scenarios: - a.) The issue description already proposes a fix. In this case and if the solution makes sense to you, you can open a PR or draft PR to fix it. - b.) The issue description does not propose a fix. In this case, you can ask what a proposed fix could look like and someone from the Diffusers team should answer shortly. If you have a good idea of how to fix it, feel free to directly open a PR. - c.) There is already an open PR to fix the issue, but the issue hasn't been closed yet. If the PR has gone stale, you can simply open a new PR and link to the stale PR. PRs often go stale if the original contributor who wanted to fix the issue suddenly cannot find the time anymore to proceed. This often happens in open-source and is very normal. In this case, the community will be very happy if you give it a new try and leverage the knowledge of the existing PR. If there is already a PR and it is active, you can help the author by giving suggestions, reviewing the PR or even asking whether you can contribute to the PR. ### 5. Contribute to the documentation A good library **always** has good documentation! The official documentation is often one of the first points of contact for new users of the library, and therefore contributing to the documentation is a **highly valuable contribution**. Contributing to the library can have many forms: - Correcting spelling or grammatical errors. - Correct incorrect formatting of the docstring. If you see that the official documentation is weirdly displayed or a link is broken, we would be very happy if you take some time to correct it. - Correct the shape or dimensions of a docstring input or output tensor. - Clarify documentation that is hard to understand or incorrect. - Update outdated code examples. - Translating the documentation to another language. Anything displayed on [the official Diffusers doc page](https://huggingface.co/docs/diffusers/index) is part of the official documentation and can be corrected, adjusted in the respective [documentation source](https://github.com/huggingface/diffusers/tree/main/docs/source). Please have a look at [this page](https://github.com/huggingface/diffusers/tree/main/docs) on how to verify changes made to the documentation locally. ### 6. Contribute a community pipeline > [!TIP] > Read the [Community pipelines](../using-diffusers/custom_pipeline_overview#community-pipelines) guide to learn more about the difference between a GitHub and Hugging Face Hub community pipeline. If you're interested in why we have community pipelines, take a look at GitHub Issue [#841](https://github.com/huggingface/diffusers/issues/841) (basically, we can't maintain all the possible ways diffusion models can be used for inference but we also don't want to prevent the community from building them). Contributing a community pipeline is a great way to share your creativity and work with the community. It lets you build on top of the [`DiffusionPipeline`] so that anyone can load and use it by setting the `custom_pipeline` parameter. This section will walk you through how to create a simple pipeline where the UNet only does a single forward pass and calls the scheduler once (a "one-step" pipeline). 1. Create a one_step_unet.py file for your community pipeline. This file can contain whatever package you want to use as long as it's installed by the user. Make sure you only have one pipeline class that inherits from [`DiffusionPipeline`] to load model weights and the scheduler configuration from the Hub. Add a UNet and scheduler to the `__init__` function. You should also add the `register_modules` function to ensure your pipeline and its components can be saved with [`~DiffusionPipeline.save_pretrained`]. ```py from diffusers import DiffusionPipeline import torch class UnetSchedulerOneForwardPipeline(DiffusionPipeline): def __init__(self, unet, scheduler): super().__init__() self.register_modules(unet=unet, scheduler=scheduler) ``` 1. In the forward pass (which we recommend defining as `__call__`), you can add any feature you'd like. For the "one-step" pipeline, create a random image and call the UNet and scheduler once by setting `timestep=1`. ```py from diffusers import DiffusionPipeline import torch class UnetSchedulerOneForwardPipeline(DiffusionPipeline): def __init__(self, unet, scheduler): super().__init__() self.register_modules(unet=unet, scheduler=scheduler) def __call__(self): image = torch.randn( (1, self.unet.config.in_channels, self.unet.config.sample_size, self.unet.config.sample_size), ) timestep = 1 model_output = self.unet(image, timestep).sample scheduler_output = self.scheduler.step(model_output, timestep, image).prev_sample return scheduler_output ``` Now you can run the pipeline by passing a UNet and scheduler to it or load pretrained weights if the pipeline structure is identical. ```py from diffusers import DDPMScheduler, UNet2DModel scheduler = DDPMScheduler() unet = UNet2DModel() pipeline = UnetSchedulerOneForwardPipeline(unet=unet, scheduler=scheduler) output = pipeline() # load pretrained weights pipeline = UnetSchedulerOneForwardPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True) output = pipeline() ``` You can either share your pipeline as a GitHub community pipeline or Hub community pipeline. Share your GitHub pipeline by opening a pull request on the Diffusers [repository](https://github.com/huggingface/diffusers) and add the one_step_unet.py file to the [examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) subfolder. Share your Hub pipeline by creating a model repository on the Hub and uploading the one_step_unet.py file to it. ### 7. Contribute to training examples Diffusers examples are a collection of training scripts that reside in [examples](https://github.com/huggingface/diffusers/tree/main/examples). We support two types of training examples: - Official training examples - Research training examples Research training examples are located in [examples/research_projects](https://github.com/huggingface/diffusers/tree/main/examples/research_projects) whereas official training examples include all folders under [examples](https://github.com/huggingface/diffusers/tree/main/examples) except the `research_projects` and `community` folders. The official training examples are maintained by the Diffusers' core maintainers whereas the research training examples are maintained by the community. This is because of the same reasons put forward in [6. Contribute a community pipeline](#6-contribute-a-community-pipeline) for official pipelines vs. community pipelines: It is not feasible for the core maintainers to maintain all possible training methods for diffusion models. If the Diffusers core maintainers and the community consider a certain training paradigm to be too experimental or not popular enough, the corresponding training code should be put in the `research_projects` folder and maintained by the author. Both official training and research examples consist of a directory that contains one or more training scripts, a `requirements.txt` file, and a `README.md` file. In order for the user to make use of the training examples, it is required to clone the repository: ```bash git clone https://github.com/huggingface/diffusers ``` as well as to install all additional dependencies required for training: ```bash cd diffusers pip install -r examples//requirements.txt ``` Therefore when adding an example, the `requirements.txt` file shall define all pip dependencies required for your training example so that once all those are installed, the user can run the example's training script. See, for example, the [DreamBooth `requirements.txt` file](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/requirements.txt). Training examples of the Diffusers library should adhere to the following philosophy: - All the code necessary to run the examples should be found in a single Python file. - One should be able to run the example from the command line with `python .py --args`. - Examples should be kept simple and serve as **an example** on how to use Diffusers for training. The purpose of example scripts is **not** to create state-of-the-art diffusion models, but rather to reproduce known training schemes without adding too much custom logic. As a byproduct of this point, our examples also strive to serve as good educational materials. To contribute an example, it is highly recommended to look at already existing examples such as [dreambooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) to get an idea of how they should look like. We strongly advise contributors to make use of the [Accelerate library](https://github.com/huggingface/accelerate) as it's tightly integrated with Diffusers. Once an example script works, please make sure to add a comprehensive `README.md` that states how to use the example exactly. This README should include: - An example command on how to run the example script as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#running-locally-with-pytorch). - A link to some training results (logs, models, etc.) that show what the user can expect as shown [here](https://api.wandb.ai/report/patrickvonplaten/xm6cd5q5). - If you are adding a non-official/research training example, **please don't forget** to add a sentence that you are maintaining this training example which includes your git handle as shown [here](https://github.com/huggingface/diffusers/tree/main/examples/research_projects/intel_opts#diffusers-examples-with-intel-optimizations). If you are contributing to the official training examples, please also make sure to add a test to its folder such as [examples/dreambooth/test_dreambooth.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/test_dreambooth.py). This is not necessary for non-official training examples. ### 8. Fixing a "Good second issue" *Good second issues* are marked by the [Good second issue](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+second+issue%22) label. Good second issues are usually more complicated to solve than [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22). The issue description usually gives less guidance on how to fix the issue and requires a decent understanding of the library by the interested contributor. If you are interested in tackling a good second issue, feel free to open a PR to fix it and link the PR to the issue. If you see that a PR has already been opened for this issue but did not get merged, have a look to understand why it wasn't merged and try to open an improved PR. Good second issues are usually more difficult to get merged compared to good first issues, so don't hesitate to ask for help from the core maintainers. If your PR is almost finished the core maintainers can also jump into your PR and commit to it in order to get it merged. ### 9. Adding pipelines, models, schedulers Pipelines, models, and schedulers are the most important pieces of the Diffusers library. They provide easy access to state-of-the-art diffusion technologies and thus allow the community to build powerful generative AI applications. By adding a new model, pipeline, or scheduler you might enable a new powerful use case for any of the user interfaces relying on Diffusers which can be of immense value for the whole generative AI ecosystem. Diffusers has a couple of open feature requests for all three components - feel free to gloss over them if you don't know yet what specific component you would like to add: - [Model or pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) - [Scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) Before adding any of the three components, it is strongly recommended that you give the [Philosophy guide](philosophy) a read to better understand the design of any of the three components. Please be aware that we cannot merge model, scheduler, or pipeline additions that strongly diverge from our design philosophy as it will lead to API inconsistencies. If you fundamentally disagree with a design choice, please open a [Feedback issue](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=) instead so that it can be discussed whether a certain design pattern/design choice shall be changed everywhere in the library and whether we shall update our design philosophy. Consistency across the library is very important for us. Please make sure to add links to the original codebase/paper to the PR and ideally also ping the original author directly on the PR so that they can follow the progress and potentially help with questions. If you are unsure or stuck in the PR, don't hesitate to leave a message to ask for a first review or help. #### Copied from mechanism A unique and important feature to understand when adding any pipeline, model or scheduler code is the `# Copied from` mechanism. You'll see this all over the Diffusers codebase, and the reason we use it is to keep the codebase easy to understand and maintain. Marking code with the `# Copied from` mechanism forces the marked code to be identical to the code it was copied from. This makes it easy to update and propagate changes across many files whenever you run `make fix-copies`. For example, in the code example below, [`~diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is the original code and `AltDiffusionPipelineOutput` uses the `# Copied from` mechanism to copy it. The only difference is changing the class prefix from `Stable` to `Alt`. ```py # Copied from diffusers.pipelines.stable_diffusion.pipeline_output.StableDiffusionPipelineOutput with Stable->Alt class AltDiffusionPipelineOutput(BaseOutput): """ Output class for Alt Diffusion pipelines. Args: images (`List[PIL.Image.Image]` or `np.ndarray`) List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, num_channels)`. nsfw_content_detected (`List[bool]`) List indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content or `None` if safety checking could not be performed. """ ``` To learn more, read this section of the [~Don't~ Repeat Yourself*](https://huggingface.co/blog/transformers-design-philosophy#4-machine-learning-models-are-static) blog post. ## How to write a good issue **The better your issue is written, the higher the chances that it will be quickly resolved.** 1. Make sure that you've used the correct template for your issue. You can pick between *Bug Report*, *Feature Request*, *Feedback about API Design*, *New model/pipeline/scheduler addition*, *Forum*, or a blank issue. Make sure to pick the correct one when opening [a new issue](https://github.com/huggingface/diffusers/issues/new/choose). 2. **Be precise**: Give your issue a fitting title. Try to formulate your issue description as simple as possible. The more precise you are when submitting an issue, the less time it takes to understand the issue and potentially solve it. Make sure to open an issue for one issue only and not for multiple issues. If you found multiple issues, simply open multiple issues. If your issue is a bug, try to be as precise as possible about what bug it is - you should not just write "Error in diffusers". 3. **Reproducibility**: No reproducible code snippet == no solution. If you encounter a bug, maintainers **have to be able to reproduce** it. Make sure that you include a code snippet that can be copy-pasted into a Python interpreter to reproduce the issue. Make sure that your code snippet works, *i.e.* that there are no missing imports or missing links to images, ... Your issue should contain an error message **and** a code snippet that can be copy-pasted without any changes to reproduce the exact same error message. If your issue is using local model weights or local data that cannot be accessed by the reader, the issue cannot be solved. If you cannot share your data or model, try to make a dummy model or dummy data. 4. **Minimalistic**: Try to help the reader as much as you can to understand the issue as quickly as possible by staying as concise as possible. Remove all code / all information that is irrelevant to the issue. If you have found a bug, try to create the easiest code example you can to demonstrate your issue, do not just dump your whole workflow into the issue as soon as you have found a bug. E.g., if you train a model and get an error at some point during the training, you should first try to understand what part of the training code is responsible for the error and try to reproduce it with a couple of lines. Try to use dummy data instead of full datasets. 5. Add links. If you are referring to a certain naming, method, or model make sure to provide a link so that the reader can better understand what you mean. If you are referring to a specific PR or issue, make sure to link it to your issue. Do not assume that the reader knows what you are talking about. The more links you add to your issue the better. 6. Formatting. Make sure to nicely format your issue by formatting code into Python code syntax, and error messages into normal code syntax. See the [official GitHub formatting docs](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) for more information. 7. Think of your issue not as a ticket to be solved, but rather as a beautiful entry to a well-written encyclopedia. Every added issue is a contribution to publicly available knowledge. By adding a nicely written issue you not only make it easier for maintainers to solve your issue, but you are helping the whole community to better understand a certain aspect of the library. ## How to write a good PR 1. Be a chameleon. Understand existing design patterns and syntax and make sure your code additions flow seamlessly into the existing code base. Pull requests that significantly diverge from existing design patterns or user interfaces will not be merged. 2. Be laser focused. A pull request should solve one problem and one problem only. Make sure to not fall into the trap of "also fixing another problem while we're adding it". It is much more difficult to review pull requests that solve multiple, unrelated problems at once. 3. If helpful, try to add a code snippet that displays an example of how your addition can be used. 4. The title of your pull request should be a summary of its contribution. 5. If your pull request addresses an issue, please mention the issue number in the pull request description to make sure they are linked (and people consulting the issue know you are working on it); 6. To indicate a work in progress please prefix the title with `[WIP]`. These are useful to avoid duplicated work, and to differentiate it from PRs ready to be merged; 7. Try to formulate and format your text as explained in [How to write a good issue](#how-to-write-a-good-issue). 8. Make sure existing tests pass; 9. Add high-coverage tests. No quality testing = no merge. - If you are adding new `@slow` tests, make sure they pass using `RUN_SLOW=1 python -m pytest tests/test_my_new_model.py`. CircleCI does not run the slow tests, but GitHub Actions does every night! 10. All public methods must have informative docstrings that work nicely with markdown. See [`pipeline_latent_diffusion.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/latent_diffusion/pipeline_latent_diffusion.py) for an example. 11. Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted `dataset` like [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) or [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images) to place these files. If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images to this dataset. ## How to open a PR Before writing code, we strongly advise you to search through the existing PRs or issues to make sure that nobody is already working on the same thing. If you are unsure, it is always a good idea to open an issue to get some feedback. You will need basic `git` proficiency to be able to contribute to 🧨 Diffusers. `git` is not the easiest tool to use but it has the greatest manual. Type `git --help` in a shell and enjoy. If you prefer books, [Pro Git](https://git-scm.com/book/en/v2) is a very good reference. Follow these steps to start contributing ([supported Python versions](https://github.com/huggingface/diffusers/blob/83bc6c94eaeb6f7704a2a428931cf2d9ad973ae9/setup.py#L270)): 1. Fork the [repository](https://github.com/huggingface/diffusers) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account. 2. Clone your fork to your local disk, and add the base repository as a remote: ```bash $ git clone git@github.com:/diffusers.git $ cd diffusers $ git remote add upstream https://github.com/huggingface/diffusers.git ``` 3. Create a new branch to hold your development changes: ```bash $ git checkout -b a-descriptive-name-for-my-changes ``` **Do not** work on the `main` branch. 4. Set up a development environment by running the following command in a virtual environment: ```bash $ pip install -e ".[dev]" ``` If you have already cloned the repo, you might need to `git pull` to get the most recent changes in the library. 5. Develop the features on your branch. As you work on the features, you should make sure that the test suite passes. You should run the tests impacted by your changes like this: ```bash $ pytest tests/.py ``` Before you run the tests, please make sure you install the dependencies required for testing. You can do so with this command: ```bash $ pip install -e ".[test]" ``` You can also run the full test suite with the following command, but it takes a beefy machine to produce a result in a decent amount of time now that Diffusers has grown a lot. Here is the command for it: ```bash $ make test ``` 🧨 Diffusers relies on `black` and `isort` to format its source code consistently. After you make changes, apply automatic style corrections and code verifications that can't be automated in one go with: ```bash $ make style ``` 🧨 Diffusers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality control runs in CI, however, you can also run the same checks with: ```bash $ make quality ``` Once you're happy with your changes, add changed files using `git add` and make a commit with `git commit` to record your changes locally: ```bash $ git add modified_file.py $ git commit -m "A descriptive message about your changes." ``` It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes: ```bash $ git pull upstream main ``` Push the changes to your account using: ```bash $ git push -u origin a-descriptive-name-for-my-changes ``` 6. Once you are satisfied, go to the webpage of your fork on GitHub. Click on 'Pull request' to send your changes to the project maintainers for review. 7. It's OK if maintainers ask you for changes. It happens to core contributors too! So everyone can see the changes in the Pull request, work in your local branch and push the changes to your fork. They will automatically appear in the pull request. ### Tests An extensive test suite is included to test the library behavior and several examples. Library tests can be found in the [tests folder](https://github.com/huggingface/diffusers/tree/main/tests). We like `pytest` and `pytest-xdist` because it's faster. From the root of the repository, here's how to run tests with `pytest` for the library: ```bash $ python -m pytest -n auto --dist=loadfile -s -v ./tests/ ``` In fact, that's how `make test` is implemented! You can specify a smaller set of tests in order to test only the feature you're working on. By default, slow tests are skipped. Set the `RUN_SLOW` environment variable to `yes` to run them. This will download many gigabytes of models — make sure you have enough disk space and a good Internet connection, or a lot of patience! ```bash $ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/ ``` `unittest` is fully supported, here's how to run tests with it: ```bash $ python -m unittest discover -s tests -t . -v $ python -m unittest discover -s examples -t examples -v ``` ### Syncing forked main with upstream (HuggingFace) main To avoid pinging the upstream repository which adds reference notes to each upstream PR and sends unnecessary notifications to the developers involved in these PRs, when syncing the main branch of a forked repository, please, follow these steps: 1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. 2. If a PR is absolutely necessary, use the following steps after checking out your branch: ```bash $ git checkout -b your-branch-for-syncing $ git pull --squash --no-commit upstream main $ git commit -m '' $ git push --set-upstream origin your-branch-for-syncing ``` ### Style guide For documentation strings, 🧨 Diffusers follows the [Google style](https://google.github.io/styleguide/pyguide.html). ## Coding with AI agents The repository keeps AI-agent configuration in `.ai/` and exposes local agent files via symlinks. - **Source of truth** — edit files under `.ai/` (`AGENTS.md` for coding guidelines, `skills/` for on-demand task knowledge) - **Don't edit** generated root-level `AGENTS.md`, `CLAUDE.md`, or `.agents/skills`/`.claude/skills` — they are symlinks - Setup commands: - `make codex` — symlink guidelines + skills for OpenAI Codex - `make claude` — symlink guidelines + skills for Claude Code - `make clean-ai` — remove all generated symlinks ================================================ FILE: docs/source/en/conceptual/ethical_guidelines.md ================================================ # 🧨 Diffusers’ Ethical Guidelines ## Preamble [Diffusers](https://huggingface.co/docs/diffusers/index) provides pre-trained diffusion models and serves as a modular toolbox for inference and training. Given its real case applications in the world and potential negative impacts on society, we think it is important to provide the project with ethical guidelines to guide the development, users’ contributions, and usage of the Diffusers library. The risks associated with using this technology are still being examined, but to name a few: copyrights issues for artists; deep-fake exploitation; sexual content generation in inappropriate contexts; non-consensual impersonation; harmful social biases perpetuating the oppression of marginalized groups. We will keep tracking risks and adapt the following guidelines based on the community's responsiveness and valuable feedback. ## Scope The Diffusers community will apply the following ethical guidelines to the project’s development and help coordinate how the community will integrate the contributions, especially concerning sensitive topics related to ethical concerns. ## Ethical guidelines The following ethical guidelines apply generally, but we will primarily implement them when dealing with ethically sensitive issues while making a technical choice. Furthermore, we commit to adapting those ethical principles over time following emerging harms related to the state of the art of the technology in question. - **Transparency**: we are committed to being transparent in managing PRs, explaining our choices to users, and making technical decisions. - **Consistency**: we are committed to guaranteeing our users the same level of attention in project management, keeping it technically stable and consistent. - **Simplicity**: with a desire to make it easy to use and exploit the Diffusers library, we are committed to keeping the project’s goals lean and coherent. - **Accessibility**: the Diffusers project helps lower the entry bar for contributors who can help run it even without technical expertise. Doing so makes research artifacts more accessible to the community. - **Reproducibility**: we aim to be transparent about the reproducibility of upstream code, models, and datasets when made available through the Diffusers library. - **Responsibility**: as a community and through teamwork, we hold a collective responsibility to our users by anticipating and mitigating this technology's potential risks and dangers. ## Examples of implementations: Safety features and Mechanisms The team works daily to make the technical and non-technical tools available to deal with the potential ethical and social risks associated with diffusion technology. Moreover, the community's input is invaluable in ensuring these features' implementation and raising awareness with us. - [**Community tab**](https://huggingface.co/docs/hub/repositories-pull-requests-discussions): it enables the community to discuss and better collaborate on a project. - **Bias exploration and evaluation**: the Hugging Face team provides a [space](https://huggingface.co/spaces/society-ethics/DiffusionBiasExplorer) to demonstrate the biases in Stable Diffusion interactively. In this sense, we support and encourage bias explorers and evaluations. - **Encouraging safety in deployment** - [**Safe Stable Diffusion**](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_safe): It mitigates the well-known issue that models, like Stable Diffusion, that are trained on unfiltered, web-crawled datasets tend to suffer from inappropriate degeneration. Related paper: [Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models](https://huggingface.co/papers/2211.05105). - [**Safety Checker**](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py): It checks and compares the class probability of a set of hard-coded harmful concepts in the embedding space against an image after it has been generated. The harmful concepts are intentionally hidden to prevent reverse engineering of the checker. - **Staged released on the Hub**: in particularly sensitive situations, access to some repositories should be restricted. This staged release is an intermediary step that allows the repository’s authors to have more control over its use. - **Licensing**: [OpenRAILs](https://huggingface.co/blog/open_rail), a new type of licensing, allow us to ensure free access while having a set of restrictions that ensure more responsible use. ================================================ FILE: docs/source/en/conceptual/evaluation.md ================================================ # Evaluating Diffusion Models Open In Colab > [!TIP] > This document has now grown outdated given the emergence of existing evaluation frameworks for diffusion models for image generation. Please check > out works like [HEIM](https://crfm.stanford.edu/helm/heim/latest/), [T2I-Compbench](https://huggingface.co/papers/2307.06350), > [GenEval](https://huggingface.co/papers/2310.11513). Evaluation of generative models like [Stable Diffusion](https://huggingface.co/docs/diffusers/stable_diffusion) is subjective in nature. But as practitioners and researchers, we often have to make careful choices amongst many different possibilities. So, when working with different generative models (like GANs, Diffusion, etc.), how do we choose one over the other? Qualitative evaluation of such models can be error-prone and might incorrectly influence a decision. However, quantitative metrics don't necessarily correspond to image quality. So, usually, a combination of both qualitative and quantitative evaluations provides a stronger signal when choosing one model over the other. In this document, we provide a non-exhaustive overview of qualitative and quantitative methods to evaluate Diffusion models. For quantitative methods, we specifically focus on how to implement them alongside `diffusers`. The methods shown in this document can also be used to evaluate different [noise schedulers](https://huggingface.co/docs/diffusers/main/en/api/schedulers/overview) keeping the underlying generation model fixed. ## Scenarios We cover Diffusion models with the following pipelines: - Text-guided image generation (such as the [`StableDiffusionPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img)). - Text-guided image generation, additionally conditioned on an input image (such as the [`StableDiffusionImg2ImgPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img) and [`StableDiffusionInstructPix2PixPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix)). - Class-conditioned image generation models (such as the [`DiTPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit)). ## Qualitative Evaluation Qualitative evaluation typically involves human assessment of generated images. Quality is measured across aspects such as compositionality, image-text alignment, and spatial relations. Common prompts provide a degree of uniformity for subjective metrics. DrawBench and PartiPrompts are prompt datasets used for qualitative benchmarking. DrawBench and PartiPrompts were introduced by [Imagen](https://imagen.research.google/) and [Parti](https://parti.research.google/) respectively. From the [official Parti website](https://parti.research.google/): > PartiPrompts (P2) is a rich set of over 1600 prompts in English that we release as part of this work. P2 can be used to measure model capabilities across various categories and challenge aspects. ![parti-prompts](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts.png) PartiPrompts has the following columns: - Prompt - Category of the prompt (such as “Abstract”, “World Knowledge”, etc.) - Challenge reflecting the difficulty (such as “Basic”, “Complex”, “Writing & Symbols”, etc.) These benchmarks allow for side-by-side human evaluation of different image generation models. For this, the 🧨 Diffusers team has built **Open Parti Prompts**, which is a community-driven qualitative benchmark based on Parti Prompts to compare state-of-the-art open-source diffusion models: - [Open Parti Prompts Game](https://huggingface.co/spaces/OpenGenAI/open-parti-prompts): For 10 parti prompts, 4 generated images are shown and the user selects the image that suits the prompt best. - [Open Parti Prompts Leaderboard](https://huggingface.co/spaces/OpenGenAI/parti-prompts-leaderboard): The leaderboard comparing the currently best open-sourced diffusion models to each other. To manually compare images, let’s see how we can use `diffusers` on a couple of PartiPrompts. Below we show some prompts sampled across different challenges: Basic, Complex, Linguistic Structures, Imagination, and Writing & Symbols. Here we are using PartiPrompts as a [dataset](https://huggingface.co/datasets/nateraw/parti-prompts). ```python from datasets import load_dataset # prompts = load_dataset("nateraw/parti-prompts", split="train") # prompts = prompts.shuffle() # sample_prompts = [prompts[i]["Prompt"] for i in range(5)] # Fixing these sample prompts in the interest of reproducibility. sample_prompts = [ "a corgi", "a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky", "a car with no windows", "a cube made of porcupine", 'The saying "BE EXCELLENT TO EACH OTHER" written on a red brick wall with a graffiti image of a green alien wearing a tuxedo. A yellow fire hydrant is on a sidewalk in the foreground.', ] ``` Now we can use these prompts to generate some images using Stable Diffusion ([v1-4 checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4)): ```python import torch seed = 0 generator = torch.manual_seed(seed) images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images ``` ![parti-prompts-14](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-14.png) We can also set `num_images_per_prompt` accordingly to compare different images for the same prompt. Running the same pipeline but with a different checkpoint ([v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5)), yields: ![parti-prompts-15](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/parti-prompts-15.png) Once several images are generated from all the prompts using multiple models (under evaluation), these results are presented to human evaluators for scoring. For more details on the DrawBench and PartiPrompts benchmarks, refer to their respective papers. > [!TIP] > It is useful to look at some inference samples while a model is training to measure the > training progress. In our [training scripts](https://github.com/huggingface/diffusers/tree/main/examples/), we support this utility with additional support for > logging to TensorBoard and Weights & Biases. ## Quantitative Evaluation In this section, we will walk you through how to evaluate three different diffusion pipelines using: - CLIP score - CLIP directional similarity - FID ### Text-guided image generation [CLIP score](https://huggingface.co/papers/2104.08718) measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility 🔼. The CLIP score is a quantitative measurement of the qualitative concept "compatibility". Image-caption pair compatibility can also be thought of as the semantic similarity between the image and the caption. CLIP score was found to have high correlation with human judgement. Let's first load a [`StableDiffusionPipeline`]: ```python from diffusers import StableDiffusionPipeline import torch model_ckpt = "CompVis/stable-diffusion-v1-4" sd_pipeline = StableDiffusionPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16).to("cuda") ``` Generate some images with multiple prompts: ```python prompts = [ "a photo of an astronaut riding a horse on mars", "A high tech solarpunk utopia in the Amazon rainforest", "A pikachu fine dining with a view to the Eiffel Tower", "A mecha robot in a favela in expressionist style", "an insect robot preparing a delicious meal", "A small cabin on top of a snowy mountain in the style of Disney, artstation", ] images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images print(images.shape) # (6, 512, 512, 3) ``` And then, we calculate the CLIP score. ```python from torchmetrics.functional.multimodal import clip_score from functools import partial clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16") def calculate_clip_score(images, prompts): images_int = (images * 255).astype("uint8") clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach() return round(float(clip_score), 4) sd_clip_score = calculate_clip_score(images, prompts) print(f"CLIP score: {sd_clip_score}") # CLIP score: 35.7038 ``` In the above example, we generated one image per prompt. If we generated multiple images per prompt, we would have to take the average score from the generated images per prompt. Now, if we wanted to compare two checkpoints compatible with the [`StableDiffusionPipeline`] we should pass a generator while calling the pipeline. First, we generate images with a fixed seed with the [v1-4 Stable Diffusion checkpoint](https://huggingface.co/CompVis/stable-diffusion-v1-4): ```python seed = 0 generator = torch.manual_seed(seed) images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images ``` Then we load the [v1-5 checkpoint](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) to generate images: ```python model_ckpt_1_5 = "stable-diffusion-v1-5/stable-diffusion-v1-5" sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=torch.float16).to("cuda") images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images ``` And finally, we compare their CLIP scores: ```python sd_clip_score_1_4 = calculate_clip_score(images, prompts) print(f"CLIP Score with v-1-4: {sd_clip_score_1_4}") # CLIP Score with v-1-4: 34.9102 sd_clip_score_1_5 = calculate_clip_score(images_1_5, prompts) print(f"CLIP Score with v-1-5: {sd_clip_score_1_5}") # CLIP Score with v-1-5: 36.2137 ``` It seems like the [v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint performs better than its predecessor. Note, however, that the number of prompts we used to compute the CLIP scores is quite low. For a more practical evaluation, this number should be way higher, and the prompts should be diverse. > [!WARNING] > By construction, there are some limitations in this score. The captions in the training dataset > were crawled from the web and extracted from `alt` and similar tags associated an image on the internet. > They are not necessarily representative of what a human being would use to describe an image. Hence we > had to "engineer" some prompts here. ### Image-conditioned text-to-image generation In this case, we condition the generation pipeline with an input image as well as a text prompt. Let's take the [`StableDiffusionInstructPix2PixPipeline`], as an example. It takes an edit instruction as an input prompt and an input image to be edited. Here is one example: ![edit-instruction](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png) One strategy to evaluate such a model is to measure the consistency of the change between the two images (in [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) space) with the change between the two image captions (as shown in [CLIP-Guided Domain Adaptation of Image Generators](https://huggingface.co/papers/2108.00946)). This is referred to as the "**CLIP directional similarity**". - Caption 1 corresponds to the input image (image 1) that is to be edited. - Caption 2 corresponds to the edited image (image 2). It should reflect the edit instruction. Following is a pictorial overview: ![edit-consistency](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-consistency.png) We have prepared a mini dataset to implement this metric. Let's first load the dataset. ```python from datasets import load_dataset dataset = load_dataset("sayakpaul/instructpix2pix-demo", split="train") dataset.features ``` ```bash {'input': Value(dtype='string', id=None), 'edit': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'image': Image(decode=True, id=None)} ``` Here we have: - `input` is a caption corresponding to the `image`. - `edit` denotes the edit instruction. - `output` denotes the modified caption reflecting the `edit` instruction. Let's take a look at a sample. ```python idx = 0 print(f"Original caption: {dataset[idx]['input']}") print(f"Edit instruction: {dataset[idx]['edit']}") print(f"Modified caption: {dataset[idx]['output']}") ``` ```bash Original caption: 2. FAROE ISLANDS: An archipelago of 18 mountainous isles in the North Atlantic Ocean between Norway and Iceland, the Faroe Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills' Edit instruction: make the isles all white marble Modified caption: 2. WHITE MARBLE ISLANDS: An archipelago of 18 mountainous white marble isles in the North Atlantic Ocean between Norway and Iceland, the White Marble Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills' ``` And here is the image: ```python dataset[idx]["image"] ``` ![edit-dataset](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-dataset.png) We will first edit the images of our dataset with the edit instruction and compute the directional similarity. Let's first load the [`StableDiffusionInstructPix2PixPipeline`]: ```python from diffusers import StableDiffusionInstructPix2PixPipeline instruct_pix2pix_pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained( "timbrooks/instruct-pix2pix", torch_dtype=torch.float16 ).to("cuda") ``` Now, we perform the edits: ```python import numpy as np def edit_image(input_image, instruction): image = instruct_pix2pix_pipeline( instruction, image=input_image, output_type="np", generator=generator, ).images[0] return image input_images = [] original_captions = [] modified_captions = [] edited_images = [] for idx in range(len(dataset)): input_image = dataset[idx]["image"] edit_instruction = dataset[idx]["edit"] edited_image = edit_image(input_image, edit_instruction) input_images.append(np.array(input_image)) original_captions.append(dataset[idx]["input"]) modified_captions.append(dataset[idx]["output"]) edited_images.append(edited_image) ``` To measure the directional similarity, we first load CLIP's image and text encoders: ```python from transformers import ( CLIPTokenizer, CLIPTextModelWithProjection, CLIPVisionModelWithProjection, CLIPImageProcessor, ) clip_id = "openai/clip-vit-large-patch14" tokenizer = CLIPTokenizer.from_pretrained(clip_id) text_encoder = CLIPTextModelWithProjection.from_pretrained(clip_id).to("cuda") image_processor = CLIPImageProcessor.from_pretrained(clip_id) image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to("cuda") ``` Notice that we are using a particular CLIP checkpoint, i.e., `openai/clip-vit-large-patch14`. This is because the Stable Diffusion pre-training was performed with this CLIP variant. For more details, refer to the [documentation](https://huggingface.co/docs/transformers/model_doc/clip). Next, we prepare a PyTorch `nn.Module` to compute directional similarity: ```python import torch.nn as nn import torch.nn.functional as F class DirectionalSimilarity(nn.Module): def __init__(self, tokenizer, text_encoder, image_processor, image_encoder): super().__init__() self.tokenizer = tokenizer self.text_encoder = text_encoder self.image_processor = image_processor self.image_encoder = image_encoder def preprocess_image(self, image): image = self.image_processor(image, return_tensors="pt")["pixel_values"] return {"pixel_values": image.to("cuda")} def tokenize_text(self, text): inputs = self.tokenizer( text, max_length=self.tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt", ) return {"input_ids": inputs.input_ids.to("cuda")} def encode_image(self, image): preprocessed_image = self.preprocess_image(image) image_features = self.image_encoder(**preprocessed_image).image_embeds image_features = image_features / image_features.norm(dim=1, keepdim=True) return image_features def encode_text(self, text): tokenized_text = self.tokenize_text(text) text_features = self.text_encoder(**tokenized_text).text_embeds text_features = text_features / text_features.norm(dim=1, keepdim=True) return text_features def compute_directional_similarity(self, img_feat_one, img_feat_two, text_feat_one, text_feat_two): sim_direction = F.cosine_similarity(img_feat_two - img_feat_one, text_feat_two - text_feat_one) return sim_direction def forward(self, image_one, image_two, caption_one, caption_two): img_feat_one = self.encode_image(image_one) img_feat_two = self.encode_image(image_two) text_feat_one = self.encode_text(caption_one) text_feat_two = self.encode_text(caption_two) directional_similarity = self.compute_directional_similarity( img_feat_one, img_feat_two, text_feat_one, text_feat_two ) return directional_similarity ``` Let's put `DirectionalSimilarity` to use now. ```python dir_similarity = DirectionalSimilarity(tokenizer, text_encoder, image_processor, image_encoder) scores = [] for i in range(len(input_images)): original_image = input_images[i] original_caption = original_captions[i] edited_image = edited_images[i] modified_caption = modified_captions[i] similarity_score = dir_similarity(original_image, edited_image, original_caption, modified_caption) scores.append(float(similarity_score.detach().cpu())) print(f"CLIP directional similarity: {np.mean(scores)}") # CLIP directional similarity: 0.0797976553440094 ``` Like the CLIP Score, the higher the CLIP directional similarity, the better it is. It should be noted that the `StableDiffusionInstructPix2PixPipeline` exposes two arguments, namely, `image_guidance_scale` and `guidance_scale` that let you control the quality of the final edited image. We encourage you to experiment with these two arguments and see the impact of that on the directional similarity. We can extend the idea of this metric to measure how similar the original image and edited version are. To do that, we can just do `F.cosine_similarity(img_feat_two, img_feat_one)`. For these kinds of edits, we would still want the primary semantics of the images to be preserved as much as possible, i.e., a high similarity score. We can use these metrics for similar pipelines such as the [`StableDiffusionPix2PixZeroPipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix_zero#diffusers.StableDiffusionPix2PixZeroPipeline). > [!TIP] > Both CLIP score and CLIP direction similarity rely on the CLIP model, which can make the evaluations biased. ***Extending metrics like IS, FID (discussed later), or KID can be difficult*** when the model under evaluation was pre-trained on a large image-captioning dataset (such as the [LAION-5B dataset](https://laion.ai/blog/laion-5b/)). This is because underlying these metrics is an InceptionNet (pre-trained on the ImageNet-1k dataset) used for extracting intermediate image features. The pre-training dataset of Stable Diffusion may have limited overlap with the pre-training dataset of InceptionNet, so it is not a good candidate here for feature extraction. ***Using the above metrics helps evaluate models that are class-conditioned. For example, [DiT](https://huggingface.co/docs/diffusers/main/en/api/pipelines/dit). It was pre-trained being conditioned on the ImageNet-1k classes.*** ### Class-conditioned image generation Class-conditioned generative models are usually pre-trained on a class-labeled dataset such as [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k). Popular metrics for evaluating these models include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), and Inception Score (IS). In this document, we focus on FID ([Heusel et al.](https://huggingface.co/papers/1706.08500)). We show how to compute it with the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit), which uses the [DiT model](https://huggingface.co/papers/2212.09748) under the hood. FID aims to measure how similar are two datasets of images. As per [this resource](https://mmgeneration.readthedocs.io/en/latest/quick_run.html#fid): > Fréchet Inception Distance is a measure of similarity between two datasets of images. It was shown to correlate well with the human judgment of visual quality and is most often used to evaluate the quality of samples of Generative Adversarial Networks. FID is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network. These two datasets are essentially the dataset of real images and the dataset of fake images (generated images in our case). FID is usually calculated with two large datasets. However, for this document, we will work with two mini datasets. Let's first download a few images from the ImageNet-1k training set: ```python from zipfile import ZipFile import requests def download(url, local_filepath): r = requests.get(url) with open(local_filepath, "wb") as f: f.write(r.content) return local_filepath dummy_dataset_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip" local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1]) with ZipFile(local_filepath, "r") as zipper: zipper.extractall(".") ``` ```python from PIL import Image import os import numpy as np dataset_path = "sample-imagenet-images" image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_path)]) real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths] ``` These are 10 images from the following ImageNet-1k classes: "cassette_player", "chain_saw" (x2), "church", "gas_pump" (x3), "parachute" (x2), and "tench".

real-images
Real images.

Now that the images are loaded, let's apply some lightweight pre-processing on them to use them for FID calculation. ```python from torchvision.transforms import functional as F import torch def preprocess_image(image): image = torch.tensor(image).unsqueeze(0) image = image.permute(0, 3, 1, 2) / 255.0 return F.center_crop(image, (256, 256)) real_images = torch.cat([preprocess_image(image) for image in real_images]) print(real_images.shape) # torch.Size([10, 3, 256, 256]) ``` We now load the [`DiTPipeline`](https://huggingface.co/docs/diffusers/api/pipelines/dit) to generate images conditioned on the above-mentioned classes. ```python from diffusers import DiTPipeline, DPMSolverMultistepScheduler dit_pipeline = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16) dit_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dit_pipeline.scheduler.config) dit_pipeline = dit_pipeline.to("cuda") seed = 0 generator = torch.manual_seed(seed) words = [ "cassette player", "chainsaw", "chainsaw", "church", "gas pump", "gas pump", "gas pump", "parachute", "parachute", "tench", ] class_ids = dit_pipeline.get_label_ids(words) output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np") fake_images = output.images fake_images = torch.tensor(fake_images) fake_images = fake_images.permute(0, 3, 1, 2) print(fake_images.shape) # torch.Size([10, 3, 256, 256]) ``` Now, we can compute the FID using [`torchmetrics`](https://torchmetrics.readthedocs.io/). ```python from torchmetrics.image.fid import FrechetInceptionDistance fid = FrechetInceptionDistance(normalize=True) fid.update(real_images, real=True) fid.update(fake_images, real=False) print(f"FID: {float(fid.compute())}") # FID: 177.7147216796875 ``` The lower the FID, the better it is. Several things can influence FID here: - Number of images (both real and fake) - Randomness induced in the diffusion process - Number of inference steps in the diffusion process - The scheduler being used in the diffusion process For the last two points, it is, therefore, a good practice to run the evaluation across different seeds and inference steps, and then report an average result. > [!WARNING] > FID results tend to be fragile as they depend on a lot of factors: > > * The specific Inception model used during computation. > * The implementation accuracy of the computation. > * The image format (not the same if we start from PNGs vs JPGs). > > Keeping that in mind, FID is often most useful when comparing similar runs, but it is > hard to reproduce paper results unless the authors carefully disclose the FID > measurement code. > > These points apply to other related metrics too, such as KID and IS. As a final step, let's visually inspect the `fake_images`.

fake-images
Fake images.

================================================ FILE: docs/source/en/conceptual/philosophy.md ================================================ # Philosophy 🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities. Its purpose is to serve as a **modular toolbox** for both inference and training. We aim at building a library that stands the test of time and therefore take API design very seriously. In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones: ## Usability over Performance - While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library. - Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages. - Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired. ## Simple over easy As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library: - We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management. - Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible. - Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers. - Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the unet, and the variational autoencoder, each have their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline. ## Tweakable, contributor-friendly over abstraction For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers. Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable. **However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because: - Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions. - Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions. - Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel. At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look at [this blog post](https://huggingface.co/blog/transformers-design-philosophy). In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond). Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗. We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️ to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=). ## Design Philosophy in Details Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). Let's walk through more in-detail design decisions for each class. ### Pipelines Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference. The following design principles are followed: - Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251). - Pipelines all inherit from [`DiffusionPipeline`]. - Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function. - Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function. - Pipelines should be used **only** for inference. - Pipelines should be very readable, self-explanatory, and easy to tweak. - Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. - Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). - Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. - Pipelines should be named after the task they are intended to solve. - In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. ### Models Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**. The following design principles are followed: - Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context. - All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc... - Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy. - Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages. - Models all inherit from `ModelMixin` and `ConfigMixin`. - Models can be optimized for performance when it doesn’t demand major code changes, keeps backward compatibility, and gives significant memory or compute gain. - Models should by default have the highest precision and lowest performance setting. - To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different. - Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work. - The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). ### Schedulers Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**. The following design principles are followed: - All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers). - Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained. - One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper). - If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism. - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. - Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](../using-diffusers/schedulers). - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. - Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". - In almost all cases, novel schedulers shall be implemented in a new scheduling file. ================================================ FILE: docs/source/en/hybrid_inference/api_reference.md ================================================ # Remote inference Remote inference provides access to an [Inference Endpoint](https://huggingface.co/docs/inference-endpoints/index) to offload local generation requirements for decoding and encoding. ## remote_decode [[autodoc]] utils.remote_utils.remote_decode ## remote_encode [[autodoc]] utils.remote_utils.remote_encode ================================================ FILE: docs/source/en/hybrid_inference/overview.md ================================================ # Remote inference > [!TIP] > This is currently an experimental feature, and if you have any feedback, please feel free to leave it [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml). Remote inference offloads the decoding and encoding process to a remote endpoint to relax the memory requirements for local inference with large models. This feature is powered by [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index). Refer to the table below for the supported models and endpoint. | Model | Endpoint | Checkpoint | Support | |---|---|---|---| | Stable Diffusion v1 | https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) | encode/decode | | Stable Diffusion XL | https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud | [madebyollin/sdxl-vae-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) | encode/decode | | Flux | https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud | [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | encode/decode | | HunyuanVideo | https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud | [hunyuanvideo-community/HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) | decode | This guide will show you how to encode and decode latents with remote inference. ## Encoding Encoding converts images and videos into latent representations. Refer to the table below for the supported VAEs. Pass an image to [`~utils.remote_encode`] to encode it. The specific `scaling_factor` and `shift_factor` values for each model can be found in the [Remote inference](../hybrid_inference/api_reference) API reference. ```py import torch from diffusers import FluxPipeline from diffusers.utils import load_image from diffusers.utils.remote_utils import remote_encode pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.float16, vae=None, device_map="cuda" ) init_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg" ) init_image = init_image.resize((768, 512)) init_latent = remote_encode( endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud", image=init_image, scaling_factor=0.3611, shift_factor=0.1159 ) ``` ## Decoding Decoding converts latent representations back into images or videos. Refer to the table below for the available and supported VAEs. Set the output type to `"latent"` in the pipeline and set the `vae` to `None`. Pass the latents to the [`~utils.remote_decode`] function. For Flux, the latents are packed so the `height` and `width` also need to be passed. The specific `scaling_factor` and `shift_factor` values for each model can be found in the [Remote inference](../hybrid_inference/api_reference) API reference. ```py from diffusers import FluxPipeline pipeline = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16, vae=None, device_map="cuda" ) prompt = """ A photorealistic Apollo-era photograph of a cat in a small astronaut suit with a bubble helmet, standing on the Moon and holding a flagpole planted in the dusty lunar soil. The flag shows a colorful paw-print emblem. Earth glows in the black sky above the stark gray surface, with sharp shadows and high-contrast lighting like vintage NASA photos. """ latent = pipeline( prompt=prompt, guidance_scale=0.0, num_inference_steps=4, output_type="latent", ).images image = remote_decode( endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/", tensor=latent, height=1024, width=1024, scaling_factor=0.3611, shift_factor=0.1159, ) image.save("image.jpg") ``` ```py import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel transformer = HunyuanVideoTransformer3DModel.from_pretrained( "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16 ) pipeline = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, vae=None, torch_dtype=torch.float16, device_map="cuda" ) latent = pipeline( prompt="A cat walks on the grass, realistic", height=320, width=512, num_frames=61, num_inference_steps=30, output_type="latent", ).frames video = remote_decode( endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/", tensor=latent, output_type="mp4", ) if isinstance(video, bytes): with open("video.mp4", "wb") as f: f.write(video) ``` ## Queuing Remote inference supports queuing to process multiple generation requests. While the current latent is being decoded, you can queue the next prompt. ```py import queue import threading from IPython.display import display from diffusers import StableDiffusionXLPipeline def decode_worker(q: queue.Queue): while True: item = q.get() if item is None: break image = remote_decode( endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/", tensor=item, scaling_factor=0.13025, ) display(image) q.task_done() q = queue.Queue() thread = threading.Thread(target=decode_worker, args=(q,), daemon=True) thread.start() def decode(latent: torch.Tensor): q.put(latent) prompts = [ "A grainy Apollo-era style photograph of a cat in a snug astronaut suit with a bubble helmet, standing on the lunar surface and gripping a flag with a paw-print emblem. The gray Moon landscape stretches behind it, Earth glowing vividly in the black sky, shadows crisp and high-contrast.", "A vintage 1960s sci-fi pulp magazine cover illustration of a heroic cat astronaut planting a flag on the Moon. Bold, saturated colors, exaggerated space gear, playful typography floating in the background, Earth painted in bright blues and greens.", "A hyper-detailed cinematic shot of a cat astronaut on the Moon holding a fluttering flag, fur visible through the helmet glass, lunar dust scattering under its feet. The vastness of space and Earth in the distance create an epic, awe-inspiring tone.", "A colorful cartoon drawing of a happy cat wearing a chunky, oversized spacesuit, proudly holding a flag with a big paw print on it. The Moon’s surface is simplified with craters drawn like doodles, and Earth in the sky has a smiling face.", "A monochrome 1969-style press photo of a “first cat on the Moon” moment. The cat, in a tiny astronaut suit, stands by a planted flag, with grainy textures, scratches, and a blurred Earth in the background, mimicking old archival space photos." ] pipeline = StableDiffusionXLPipeline.from_pretrained( "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, vae=None, device_map="cuda" ) pipeline.unet = pipeline.unet.to(memory_format=torch.channels_last) pipeline.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) _ = pipeline( prompt=prompts[0], output_type="latent", ) for prompt in prompts: latent = pipeline( prompt=prompt, output_type="latent", ).images decode(latent) q.put(None) thread.join() ``` ## Benchmarks The tables demonstrate the memory requirements for encoding and decoding with Stable Diffusion v1.5 and SDXL on different GPUs. For the majority of these GPUs, the memory usage dictates whether other models (text encoders, UNet/transformer) need to be offloaded or required tiled encoding. The latter two techniques increases inference time and impacts quality.
Encoding - Stable Diffusion v1.5 | GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) | |:------------------------------|:-------------|-----------------:|-------------:|--------------------:|-------------------:| | NVIDIA GeForce RTX 4090 | 512x512 | 0.015 | 3.51901 | 0.015 | 3.51901 | | NVIDIA GeForce RTX 4090 | 256x256 | 0.004 | 1.3154 | 0.005 | 1.3154 | | NVIDIA GeForce RTX 4090 | 2048x2048 | 0.402 | 47.1852 | 0.496 | 3.51901 | | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.078 | 12.2658 | 0.094 | 3.51901 | | NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.023 | 5.30105 | 0.023 | 5.30105 | | NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.006 | 1.98152 | 0.006 | 1.98152 | | NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 0.574 | 71.08 | 0.656 | 5.30105 | | NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.111 | 18.4772 | 0.14 | 5.30105 | | NVIDIA GeForce RTX 3090 | 512x512 | 0.032 | 3.52782 | 0.032 | 3.52782 | | NVIDIA GeForce RTX 3090 | 256x256 | 0.01 | 1.31869 | 0.009 | 1.31869 | | NVIDIA GeForce RTX 3090 | 2048x2048 | 0.742 | 47.3033 | 0.954 | 3.52782 | | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.136 | 12.2965 | 0.207 | 3.52782 | | NVIDIA GeForce RTX 3080 | 512x512 | 0.036 | 8.51761 | 0.036 | 8.51761 | | NVIDIA GeForce RTX 3080 | 256x256 | 0.01 | 3.18387 | 0.01 | 3.18387 | | NVIDIA GeForce RTX 3080 | 2048x2048 | 0.863 | 86.7424 | 1.191 | 8.51761 | | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.157 | 29.6888 | 0.227 | 8.51761 | | NVIDIA GeForce RTX 3070 | 512x512 | 0.051 | 10.6941 | 0.051 | 10.6941 | | NVIDIA GeForce RTX 3070 | 256x256 | 0.015 | 3.99743 | 0.015 | 3.99743 | | NVIDIA GeForce RTX 3070 | 2048x2048 | 1.217 | 96.054 | 1.482 | 10.6941 | | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.223 | 37.2751 | 0.327 | 10.6941 |
Encoding SDXL | GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) | |:------------------------------|:-------------|-----------------:|----------------------:|-----------------------:|-------------------:| | NVIDIA GeForce RTX 4090 | 512x512 | 0.029 | 4.95707 | 0.029 | 4.95707 | | NVIDIA GeForce RTX 4090 | 256x256 | 0.007 | 2.29666 | 0.007 | 2.29666 | | NVIDIA GeForce RTX 4090 | 2048x2048 | 0.873 | 66.3452 | 0.863 | 15.5649 | | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.142 | 15.5479 | 0.143 | 15.5479 | | NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.044 | 7.46735 | 0.044 | 7.46735 | | NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.01 | 3.4597 | 0.01 | 3.4597 | | NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 1.317 | 87.1615 | 1.291 | 23.447 | | NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.213 | 23.4215 | 0.214 | 23.4215 | | NVIDIA GeForce RTX 3090 | 512x512 | 0.058 | 5.65638 | 0.058 | 5.65638 | | NVIDIA GeForce RTX 3090 | 256x256 | 0.016 | 2.45081 | 0.016 | 2.45081 | | NVIDIA GeForce RTX 3090 | 2048x2048 | 1.755 | 77.8239 | 1.614 | 18.4193 | | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.265 | 18.4023 | 0.265 | 18.4023 | | NVIDIA GeForce RTX 3080 | 512x512 | 0.064 | 13.6568 | 0.064 | 13.6568 | | NVIDIA GeForce RTX 3080 | 256x256 | 0.018 | 5.91728 | 0.018 | 5.91728 | | NVIDIA GeForce RTX 3080 | 2048x2048 | OOM | OOM | 1.866 | 44.4717 | | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.302 | 44.4308 | 0.302 | 44.4308 | | NVIDIA GeForce RTX 3070 | 512x512 | 0.093 | 17.1465 | 0.093 | 17.1465 | | NVIDIA GeForce RTX 3070 | 256x256 | 0.025 | 7.42931 | 0.026 | 7.42931 | | NVIDIA GeForce RTX 3070 | 2048x2048 | OOM | OOM | 2.674 | 55.8355 | | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.443 | 55.7841 | 0.443 | 55.7841 |
Decoding - Stable Diffusion v1.5 | GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) | | --- | --- | --- | --- | --- | --- | | NVIDIA GeForce RTX 4090 | 512x512 | 0.031 | 5.60% | 0.031 (0%) | 5.60% | | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.148 | 20.00% | 0.301 (+103%) | 5.60% | | NVIDIA GeForce RTX 4080 | 512x512 | 0.05 | 8.40% | 0.050 (0%) | 8.40% | | NVIDIA GeForce RTX 4080 | 1024x1024 | 0.224 | 30.00% | 0.356 (+59%) | 8.40% | | NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.066 | 11.30% | 0.066 (0%) | 11.30% | | NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.284 | 40.50% | 0.454 (+60%) | 11.40% | | NVIDIA GeForce RTX 3090 | 512x512 | 0.062 | 5.20% | 0.062 (0%) | 5.20% | | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.253 | 18.50% | 0.464 (+83%) | 5.20% | | NVIDIA GeForce RTX 3080 | 512x512 | 0.07 | 12.80% | 0.070 (0%) | 12.80% | | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.286 | 45.30% | 0.466 (+63%) | 12.90% | | NVIDIA GeForce RTX 3070 | 512x512 | 0.102 | 15.90% | 0.102 (0%) | 15.90% | | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.421 | 56.30% | 0.746 (+77%) | 16.00% |
Decoding SDXL | GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) | | --- | --- | --- | --- | --- | --- | | NVIDIA GeForce RTX 4090 | 512x512 | 0.057 | 10.00% | 0.057 (0%) | 10.00% | | NVIDIA GeForce RTX 4090 | 1024x1024 | 0.256 | 35.50% | 0.257 (+0.4%) | 35.50% | | NVIDIA GeForce RTX 4080 | 512x512 | 0.092 | 15.00% | 0.092 (0%) | 15.00% | | NVIDIA GeForce RTX 4080 | 1024x1024 | 0.406 | 53.30% | 0.406 (0%) | 53.30% | | NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.121 | 20.20% | 0.120 (-0.8%) | 20.20% | | NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.519 | 72.00% | 0.519 (0%) | 72.00% | | NVIDIA GeForce RTX 3090 | 512x512 | 0.107 | 10.50% | 0.107 (0%) | 10.50% | | NVIDIA GeForce RTX 3090 | 1024x1024 | 0.459 | 38.00% | 0.460 (+0.2%) | 38.00% | | NVIDIA GeForce RTX 3080 | 512x512 | 0.121 | 25.60% | 0.121 (0%) | 25.60% | | NVIDIA GeForce RTX 3080 | 1024x1024 | 0.524 | 93.00% | 0.524 (0%) | 93.00% | | NVIDIA GeForce RTX 3070 | 512x512 | 0.183 | 31.80% | 0.183 (0%) | 31.80% | | NVIDIA GeForce RTX 3070 | 1024x1024 | 0.794 | 96.40% | 0.794 (0%) | 96.40% |
## Resources - Remote inference is also supported in [SD.Next](https://github.com/vladmandic/sdnext) and [ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae). - Refer to the [Remote VAEs for decoding with Inference Endpoints](https://huggingface.co/blog/remote_vae) blog post to learn more. ================================================ FILE: docs/source/en/index.md ================================================



# Diffusers Diffusers is a library of state-of-the-art pretrained diffusion models for generating videos, images, and audio. The library revolves around the [`DiffusionPipeline`], an API designed for: - easy inference with only a few lines of code - flexibility to mix-and-match pipeline components (models, schedulers) - loading and using adapters like LoRA Diffusers also comes with optimizations - such as offloading and quantization - to ensure even the largest models are accessible on memory-constrained devices. If memory is not an issue, Diffusers supports torch.compile to boost inference speed. Get started right away with a Diffusers model on the [Hub](https://huggingface.co/models?library=diffusers&sort=trending) today! ## Learn If you're a beginner, we recommend starting with the [Hugging Face Diffusion Models Course](https://huggingface.co/learn/diffusion-course/unit0/1). You'll learn the theory behind diffusion models, and learn how to use the Diffusers library to generate images, fine-tune your own models, and more. ================================================ FILE: docs/source/en/installation.md ================================================ # Installation Diffusers is tested on Python 3.8+ and PyTorch 1.4+. Install [PyTorch](https://pytorch.org/get-started/locally/) according to your system and setup. Create a [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) for easier management of separate projects and to avoid compatibility issues between dependencies. Use [uv](https://docs.astral.sh/uv/), a Rust-based Python package and project manager, to create a virtual environment and install Diffusers. ```bash uv venv my-env source my-env/bin/activate ``` Install Diffusers with one of the following methods. PyTorch only supports Python 3.8 - 3.11 on Windows. ```bash uv pip install diffusers["torch"] transformers ``` ```bash conda install -c conda-forge diffusers ``` A source install installs the `main` version instead of the latest `stable` version. The `main` version is useful for staying updated with the latest changes but it may not always be stable. If you run into a problem, open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) and we will try to resolve it as soon as possible. Make sure [Accelerate](https://huggingface.co/docs/accelerate/index) is installed. ```bash uv pip install accelerate ``` Install Diffusers from source with the command below. ```bash uv pip install git+https://github.com/huggingface/diffusers ``` ## Editable install An editable install is recommended for development workflows or if you're using the `main` version of the source code. A special link is created between the cloned repository and the Python library paths. This avoids reinstalling a package after every change. Clone the repository and install Diffusers with the following commands. ```bash git clone https://github.com/huggingface/diffusers.git cd diffusers uv pip install -e ".[torch]" ``` > [!WARNING] > You must keep the `diffusers` folder if you want to keep using the library with the editable install. Update your cloned repository to the latest version of Diffusers with the command below. ```bash cd ~/diffusers/ git pull ``` ## Cache Model weights and files are downloaded from the Hub to a cache, which is usually your home directory. Change the cache location with the [HF_HOME](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome) or [HF_HUB_CACHE](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhubcache) environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`]. ```bash export HF_HOME="/path/to/your/cache" export HF_HUB_CACHE="/path/to/your/hub/cache" ``` ```py from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", cache_dir="/path/to/your/cache" ) ``` Cached files allow you to use Diffusers offline. Set the [HF_HUB_OFFLINE](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhuboffline) environment variable to `1` to prevent Diffusers from connecting to the internet. ```shell export HF_HUB_OFFLINE=1 ``` For more details about managing and cleaning the cache, take a look at the [Understand caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide. ## Telemetry logging Diffusers gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests. The data gathered includes the Diffusers and PyTorch version, the requested model or pipeline class, and the path to a pretrained checkpoint if it is hosted on the Hub. This usage data helps us debug issues and prioritize new features. Telemetry is only sent when loading models and pipelines from the Hub, and it is not collected if you're loading local files. Opt-out and disable telemetry collection with the [HF_HUB_DISABLE_TELEMETRY](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhubdisabletelemetry) environment variable. ```bash export HF_HUB_DISABLE_TELEMETRY=1 ``` ```bash set HF_HUB_DISABLE_TELEMETRY=1 ``` ================================================ FILE: docs/source/en/modular_diffusers/auto_pipeline_blocks.md ================================================ # AutoPipelineBlocks [`~modular_pipelines.AutoPipelineBlocks`] are a multi-block type containing blocks that support different workflows. It automatically selects which sub-blocks to run based on the input provided at runtime. This is typically used to package multiple workflows - text-to-image, image-to-image, inpaint - into a single pipeline for convenience. This guide shows how to create [`~modular_pipelines.AutoPipelineBlocks`]. Create three [`~modular_pipelines.ModularPipelineBlocks`] for text-to-image, image-to-image, and inpainting. These represent the different workflows available in the pipeline. ```py import torch from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam class TextToImageBlock(ModularPipelineBlocks): model_name = "text2img" @property def inputs(self): return [InputParam(name="prompt")] @property def intermediate_outputs(self): return [] @property def description(self): return "I'm a text-to-image workflow!" def __call__(self, components, state): block_state = self.get_block_state(state) print("running the text-to-image workflow") # Add your text-to-image logic here # For example: generate image from prompt self.set_block_state(state, block_state) return components, state ``` ```py class ImageToImageBlock(ModularPipelineBlocks): model_name = "img2img" @property def inputs(self): return [InputParam(name="prompt"), InputParam(name="image")] @property def intermediate_outputs(self): return [] @property def description(self): return "I'm an image-to-image workflow!" def __call__(self, components, state): block_state = self.get_block_state(state) print("running the image-to-image workflow") # Add your image-to-image logic here # For example: transform input image based on prompt self.set_block_state(state, block_state) return components, state ``` ```py class InpaintBlock(ModularPipelineBlocks): model_name = "inpaint" @property def inputs(self): return [InputParam(name="prompt"), InputParam(name="image"), InputParam(name="mask")] @property def intermediate_outputs(self): return [] @property def description(self): return "I'm an inpaint workflow!" def __call__(self, components, state): block_state = self.get_block_state(state) print("running the inpaint workflow") # Add your inpainting logic here # For example: fill masked areas based on prompt self.set_block_state(state, block_state) return components, state ``` Create an [`~modular_pipelines.AutoPipelineBlocks`] class that includes a list of the sub-block classes and their corresponding block names. You also need to include `block_trigger_inputs`, a list of input names that trigger the corresponding block. If a trigger input is provided at runtime, then that block is selected to run. Use `None` to specify the default block to run if no trigger inputs are detected. Lastly, it is important to include a `description` that clearly explains which inputs trigger which workflow. This helps users understand how to run specific workflows. ```py from diffusers.modular_pipelines import AutoPipelineBlocks class AutoImageBlocks(AutoPipelineBlocks): # List of sub-block classes to choose from block_classes = [InpaintBlock, ImageToImageBlock, TextToImageBlock] # Names for each block in the same order block_names = ["inpaint", "img2img", "text2img"] # Trigger inputs that determine which block to run # - "mask" triggers inpaint workflow # - "image" triggers img2img workflow (but only if mask is not provided) # - if none of above, runs the text2img workflow (default) block_trigger_inputs = ["mask", "image", None] @property def description(self): return ( "Pipeline generates images given different types of conditions!\n" + "This is an auto pipeline block that works for text2img, img2img and inpainting tasks.\n" + " - inpaint workflow is run when `mask` is provided.\n" + " - img2img workflow is run when `image` is provided (but only when `mask` is not provided).\n" + " - text2img workflow is run when neither `image` nor `mask` is provided.\n" ) ``` It is **very** important to include a `description` to avoid any confusion over how to run a block and what inputs are required. While [`~modular_pipelines.AutoPipelineBlocks`] are convenient, its conditional logic may be difficult to figure out if it isn't properly explained. Create an instance of `AutoImageBlocks`. ```py auto_blocks = AutoImageBlocks() ``` For more complex compositions, such as nested [`~modular_pipelines.AutoPipelineBlocks`] blocks when they're used as sub-blocks in larger pipelines, use the [`~modular_pipelines.SequentialPipelineBlocks.get_execution_blocks`] method to extract the a block that is actually run based on your input. ```py auto_blocks.get_execution_blocks(mask=True) ``` ## ConditionalPipelineBlocks [`~modular_pipelines.AutoPipelineBlocks`] is a special case of [`~modular_pipelines.ConditionalPipelineBlocks`]. While [`~modular_pipelines.AutoPipelineBlocks`] selects blocks based on whether a trigger input is provided or not, [`~modular_pipelines.ConditionalPipelineBlocks`] is able to select a block based on custom selection logic provided in the `select_block` method. Here is the same example written using [`~modular_pipelines.ConditionalPipelineBlocks`] directly: ```py from diffusers.modular_pipelines import ConditionalPipelineBlocks class AutoImageBlocks(ConditionalPipelineBlocks): block_classes = [InpaintBlock, ImageToImageBlock, TextToImageBlock] block_names = ["inpaint", "img2img", "text2img"] block_trigger_inputs = ["mask", "image"] default_block_name = "text2img" @property def description(self): return ( "Pipeline generates images given different types of conditions!\n" + "This is an auto pipeline block that works for text2img, img2img and inpainting tasks.\n" + " - inpaint workflow is run when `mask` is provided.\n" + " - img2img workflow is run when `image` is provided (but only when `mask` is not provided).\n" + " - text2img workflow is run when neither `image` nor `mask` is provided.\n" ) def select_block(self, mask=None, image=None) -> str | None: if mask is not None: return "inpaint" if image is not None: return "img2img" return None # falls back to default_block_name ("text2img") ``` The inputs listed in `block_trigger_inputs` are passed as keyword arguments to `select_block()`. When `select_block` returns `None`, it falls back to `default_block_name`. If `default_block_name` is also `None`, the entire conditional block is skipped — this is useful for optional processing steps that should only run when specific inputs are provided. ## Workflows Pipelines that contain conditional blocks ([`~modular_pipelines.AutoPipelineBlocks`] or [`~modular_pipelines.ConditionalPipelineBlocks]`) can support multiple workflows — for example, our SDXL modular pipeline supports a dozen workflows all in one pipeline. But this also means it can be confusing for users to know what workflows are supported and how to run them. For pipeline builders, it's useful to be able to extract only the blocks relevant to a specific workflow. We recommend defining a `_workflow_map` to give each workflow a name and explicitly list the inputs it requires. ```py from diffusers.modular_pipelines import SequentialPipelineBlocks class MyPipelineBlocks(SequentialPipelineBlocks): block_classes = [TextEncoderBlock, AutoImageBlocks, DecodeBlock] block_names = ["text_encoder", "auto_image", "decode"] _workflow_map = { "text2image": {"prompt": True}, "image2image": {"image": True, "prompt": True}, "inpaint": {"mask": True, "image": True, "prompt": True}, } ``` All of our built-in modular pipelines come with pre-defined workflows. The `available_workflows` property lists all supported workflows: ```py pipeline_blocks = MyPipelineBlocks() pipeline_blocks.available_workflows # ['text2image', 'image2image', 'inpaint'] ``` Retrieve a specific workflow with `get_workflow` to inspect and debug a specific block that executes the workflow. ```py pipeline_blocks.get_workflow("inpaint") ``` ================================================ FILE: docs/source/en/modular_diffusers/components_manager.md ================================================ # ComponentsManager The [`ComponentsManager`] is a model registry and management system for Modular Diffusers. It adds and tracks models, stores useful metadata (model size, device placement, adapters), and supports offloading. This guide will show you how to use [`ComponentsManager`] to manage components and device memory. ## Connect to a pipeline Create a [`ComponentsManager`] and pass it to a [`ModularPipeline`] with either [`~ModularPipeline.from_pretrained`] or [`~ModularPipelineBlocks.init_pipeline`]. ```py from diffusers import ModularPipeline, ComponentsManager import torch manager = ComponentsManager() pipe = ModularPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", components_manager=manager) pipe.load_components(torch_dtype=torch.bfloat16) ``` ```py from diffusers import ModularPipelineBlocks, ComponentsManager import torch manager = ComponentsManager() blocks = ModularPipelineBlocks.from_pretrained("diffusers/Florence2-image-Annotator", trust_remote_code=True) pipe= blocks.init_pipeline(components_manager=manager) pipe.load_components(torch_dtype=torch.bfloat16) ``` Components loaded by the pipeline are automatically registered in the manager. You can inspect them right away. ## Inspect components Print the [`ComponentsManager`] to see all registered components, including their class, device placement, dtype, memory size, and load ID. The output below corresponds to the `from_pretrained` example above. ```py Components: ============================================================================================================================= Models: ----------------------------------------------------------------------------------------------------------------------------- Name_ID | Class | Device: act(exec) | Dtype | Size (GB) | Load ID ----------------------------------------------------------------------------------------------------------------------------- text_encoder_140458257514752 | Qwen3Model | cpu | torch.bfloat16 | 7.49 | Tongyi-MAI/Z-Image-Turbo|text_encoder|null|null vae_140458257515376 | AutoencoderKL | cpu | torch.bfloat16 | 0.16 | Tongyi-MAI/Z-Image-Turbo|vae|null|null transformer_140458257515616 | ZImageTransformer2DModel | cpu | torch.bfloat16 | 11.46 | Tongyi-MAI/Z-Image-Turbo|transformer|null|null ----------------------------------------------------------------------------------------------------------------------------- Other Components: ----------------------------------------------------------------------------------------------------------------------------- ID | Class | Collection ----------------------------------------------------------------------------------------------------------------------------- scheduler_140461023555264 | FlowMatchEulerDiscreteScheduler | N/A tokenizer_140458256346432 | Qwen2Tokenizer | N/A ----------------------------------------------------------------------------------------------------------------------------- ``` The table shows models (with device, dtype, and memory info) separately from other components like schedulers and tokenizers. If any models have LoRA adapters, IP-Adapters, or quantization applied, that information is displayed in an additional section at the bottom. ## Offloading The [`~ComponentsManager.enable_auto_cpu_offload`] method is a global offloading strategy that works across all models regardless of which pipeline is using them. Once enabled, you don't need to worry about device placement if you add or remove components. ```py manager.enable_auto_cpu_offload(device="cuda") ``` All models begin on the CPU and [`ComponentsManager`] moves them to the appropriate device right before they're needed, and moves other models back to the CPU when GPU memory is low. Call [`~ComponentsManager.disable_auto_cpu_offload`] to disable offloading. ```py manager.disable_auto_cpu_offload() ``` ================================================ FILE: docs/source/en/modular_diffusers/custom_blocks.md ================================================ # Building Custom Blocks [ModularPipelineBlocks](./pipeline_block) are the fundamental building blocks of a [`ModularPipeline`]. You can create custom blocks by defining their inputs, outputs, and computation logic. This guide demonstrates how to create and use a custom block. > [!TIP] > Explore the [Modular Diffusers Custom Blocks](https://huggingface.co/collections/diffusers/modular-diffusers-custom-blocks) collection for official custom blocks. ## Project Structure Your custom block project should use the following structure: ```shell . ├── block.py └── modular_config.json ``` - `block.py` contains the custom block implementation - `modular_config.json` contains the metadata needed to load the block ## Quick Start with Template The fastest way to create a custom block is to start from our template. The template provides a pre-configured project structure with `block.py` and `modular_config.json` files, plus commented examples showing how to define components, inputs, outputs, and the `__call__` method—so you can focus on your custom logic instead of boilerplate setup. ### Download the template ```python from diffusers import ModularPipelineBlocks model_id = "diffusers/custom-block-template" local_dir = model_id.split("/")[-1] blocks = ModularPipelineBlocks.from_pretrained( model_id, trust_remote_code=True, local_dir=local_dir ) ``` This saves the template files to `custom-block-template/` locally or you could use `local_dir` to save to a specific location. ### Edit locally Open `block.py` and implement your custom block. The template includes commented examples showing how to define each property. See the [Florence-2 example](#example-florence-2-image-annotator) below for a complete implementation. ### Test your block ```python from diffusers import ModularPipelineBlocks blocks = ModularPipelineBlocks.from_pretrained(local_dir, trust_remote_code=True) pipeline = blocks.init_pipeline() output = pipeline(...) # your inputs here ``` ### Upload to the Hub ```python pipeline.save_pretrained(local_dir, repo_id="your-username/your-block-name", push_to_hub=True) ``` ## Example: Florence-2 Image Annotator This example creates a custom block with [Florence-2](https://huggingface.co/docs/transformers/model_doc/florence2) to process an input image and generate a mask for inpainting. ### Define components Define the components the block needs, `Florence2ForConditionalGeneration` and its processor. When defining components, specify the `name` (how you'll access it in code), `type_hint` (the model class), and `pretrained_model_name_or_path` (where to load weights from). ```python # Inside block.py from diffusers.modular_pipelines import ModularPipelineBlocks, ComponentSpec from transformers import AutoProcessor, Florence2ForConditionalGeneration class Florence2ImageAnnotatorBlock(ModularPipelineBlocks): @property def expected_components(self): return [ ComponentSpec( name="image_annotator", type_hint=Florence2ForConditionalGeneration, pretrained_model_name_or_path="florence-community/Florence-2-base-ft", ), ComponentSpec( name="image_annotator_processor", type_hint=AutoProcessor, pretrained_model_name_or_path="florence-community/Florence-2-base-ft", ), ] ``` ### Define inputs and outputs Inputs include the image, annotation task, and prompt. Outputs include the generated mask and annotations. ```python from typing import List, Union from PIL import Image from diffusers.modular_pipelines import InputParam, OutputParam class Florence2ImageAnnotatorBlock(ModularPipelineBlocks): # ... expected_components from above ... @property def inputs(self) -> List[InputParam]: return [ InputParam( "image", type_hint=Union[Image.Image, List[Image.Image]], required=True, description="Image(s) to annotate", ), InputParam( "annotation_task", type_hint=str, default="", description="Annotation task to perform (e.g., , , )", ), InputParam( "annotation_prompt", type_hint=str, required=True, description="Prompt to provide context for the annotation task", ), InputParam( "annotation_output_type", type_hint=str, default="mask_image", description="Output type: 'mask_image', 'mask_overlay', or 'bounding_box'", ), ] @property def intermediate_outputs(self) -> List[OutputParam]: return [ OutputParam( "mask_image", type_hint=Image.Image, description="Inpainting mask for the input image", ), OutputParam( "annotations", type_hint=dict, description="Raw annotation predictions", ), OutputParam( "image", type_hint=Image.Image, description="Annotated image", ), ] ``` ### Implement the `__call__` method The `__call__` method contains the block's logic. Access inputs via `block_state`, run your computation, and set outputs back to `block_state`. ```python import torch from diffusers.modular_pipelines import PipelineState class Florence2ImageAnnotatorBlock(ModularPipelineBlocks): # ... expected_components, inputs, intermediate_outputs from above ... @torch.no_grad() def __call__(self, components, state: PipelineState) -> PipelineState: block_state = self.get_block_state(state) images, annotation_task_prompt = self.prepare_inputs( block_state.image, block_state.annotation_prompt ) task = block_state.annotation_task fill = block_state.fill annotations = self.get_annotations( components, images, annotation_task_prompt, task ) block_state.annotations = annotations if block_state.annotation_output_type == "mask_image": block_state.mask_image = self.prepare_mask(images, annotations) else: block_state.mask_image = None if block_state.annotation_output_type == "mask_overlay": block_state.image = self.prepare_mask(images, annotations, overlay=True, fill=fill) elif block_state.annotation_output_type == "bounding_box": block_state.image = self.prepare_bounding_boxes(images, annotations) self.set_block_state(state, block_state) return components, state # Helper methods for mask/bounding box generation... ``` > [!TIP] > See the complete implementation at [diffusers/Florence2-image-Annotator](https://huggingface.co/diffusers/Florence2-image-Annotator). ## Using Custom Blocks Load a custom block with [`~ModularPipeline.from_pretrained`] and set `trust_remote_code=True`. ```py import torch from diffusers import ModularPipeline from diffusers.utils import load_image # Load the Florence-2 annotator pipeline image_annotator = ModularPipeline.from_pretrained( "diffusers/Florence2-image-Annotator", trust_remote_code=True ) # Check the docstring to see inputs/outputs print(image_annotator.blocks.doc) ``` Use the block to generate a mask: ```python image_annotator.load_components(torch_dtype=torch.bfloat16) image_annotator.to("cuda") image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg") image = image.resize((1024, 1024)) prompt = ["A red car"] annotation_task = "" annotation_prompt = ["the car"] mask_image = image_annotator_node( prompt=prompt, image=image, annotation_task=annotation_task, annotation_prompt=annotation_prompt, annotation_output_type="mask_image", ).images mask_image[0].save("car-mask.png") ``` Compose it with other blocks to create a new pipeline: ```python # Get the annotator block annotator_block = image_annotator.blocks # Get an inpainting workflow and insert the annotator at the beginning inpaint_blocks = ModularPipeline.from_pretrained("Qwen/Qwen-Image").blocks.get_workflow("inpainting") inpaint_blocks.sub_blocks.insert("image_annotator", annotator_block, 0) # Initialize the combined pipeline pipe = inpaint_blocks.init_pipeline() pipe.load_components(torch_dtype=torch.float16, device="cuda") # Now the pipeline automatically generates masks from prompts output = pipe( prompt=prompt, image=image, annotation_task=annotation_task, annotation_prompt=annotation_prompt, annotation_output_type="mask_image", num_inference_steps=35, guidance_scale=7.5, strength=0.95, output="images" ) output[0].save("florence-inpainting.png") ``` ## Editing custom blocks Edit custom blocks by downloading it locally. This is the same workflow as the [Quick Start with Template](#quick-start-with-template), but starting from an existing block instead of the template. Use the `local_dir` argument to download a custom block to a specific folder: ```python from diffusers import ModularPipelineBlocks # Download to a local folder for editing annotator_block = ModularPipelineBlocks.from_pretrained( "diffusers/Florence2-image-Annotator", trust_remote_code=True, local_dir="./my-florence-block" ) ``` Any changes made to the block files in this folder will be reflected when you load the block again. When you're ready to share your changes, upload to a new repository: ```python pipeline = annotator_block.init_pipeline() pipeline.save_pretrained("./my-florence-block", repo_id="your-username/my-custom-florence", push_to_hub=True) ``` ## Next Steps This guide covered creating a single custom block. Learn how to compose multiple blocks together: - [SequentialPipelineBlocks](./sequential_pipeline_blocks): Chain blocks to execute in sequence - [ConditionalPipelineBlocks](./auto_pipeline_blocks): Create conditional blocks that select different execution paths - [LoopSequentialPipelineBlocks](./loop_sequential_pipeline_blocks): Define an iterative workflows like the denoising loop Make your custom block work with Mellon's visual interface. See the [Mellon Custom Blocks](./mellon) guide. Browse the [Modular Diffusers Custom Blocks](https://huggingface.co/collections/diffusers/modular-diffusers-custom-blocks) collection for inspiration and ready-to-use blocks. ## Dependencies Declaring package dependencies in custom blocks prevents runtime import errors later on. Diffusers validates the dependencies and returns a warning if a package is missing or incompatible. Set a `_requirements` attribute in your block class, mapping package names to version specifiers. ```py from diffusers.modular_pipelines import PipelineBlock class MyCustomBlock(PipelineBlock): _requirements = { "transformers": ">=4.44.0", "sentencepiece": ">=0.2.0" } ``` When there are blocks with different requirements, Diffusers merges their requirements. ```py from diffusers.modular_pipelines import SequentialPipelineBlocks class BlockA(PipelineBlock): _requirements = {"transformers": ">=4.44.0"} # ... class BlockB(PipelineBlock): _requirements = {"sentencepiece": ">=0.2.0"} # ... pipe = SequentialPipelineBlocks.from_blocks_dict({ "block_a": BlockA, "block_b": BlockB, }) ``` When this block is saved with [`~ModularPipeline.save_pretrained`], the requirements are saved to the `modular_config.json` file. When this block is loaded, Diffusers checks each requirement against the current environment. If there is a mismatch or a package isn't found, Diffusers returns the following warning. ```md # missing package xyz-package was specified in the requirements but wasn't found in the current environment. # version mismatch xyz requirement 'specific-version' is not satisfied by the installed version 'actual-version'. Things might work unexpected. ``` ================================================ FILE: docs/source/en/modular_diffusers/loop_sequential_pipeline_blocks.md ================================================ # LoopSequentialPipelineBlocks [`~modular_pipelines.LoopSequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a loop. Data flows circularly, using `inputs` and `intermediate_outputs`, and each block is run iteratively. This is typically used to create a denoising loop which is iterative by default. This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. ## Loop wrapper [`~modular_pipelines.LoopSequentialPipelineBlocks`], is also known as the *loop wrapper* because it defines the loop structure, iteration variables, and configuration. Within the loop wrapper, you need the following variables. - `loop_inputs` are user provided values and equivalent to [`~modular_pipelines.ModularPipelineBlocks.inputs`]. - `loop_intermediate_outputs` are new intermediate variables created by the block and added to the [`~modular_pipelines.PipelineState`]. It is equivalent to [`~modular_pipelines.ModularPipelineBlocks.intermediate_outputs`]. - `__call__` method defines the loop structure and iteration logic. ```py import torch from diffusers.modular_pipelines import LoopSequentialPipelineBlocks, ModularPipelineBlocks, InputParam, OutputParam class LoopWrapper(LoopSequentialPipelineBlocks): model_name = "test" @property def description(self): return "I'm a loop!!" @property def loop_inputs(self): return [InputParam(name="num_steps")] @torch.no_grad() def __call__(self, components, state): block_state = self.get_block_state(state) # Loop structure - can be customized to your needs for i in range(block_state.num_steps): # loop_step executes all registered blocks in sequence components, block_state = self.loop_step(components, block_state, i=i) self.set_block_state(state, block_state) return components, state ``` The loop wrapper can pass additional arguments, like current iteration index, to the loop blocks. ## Loop blocks A loop block is a [`~modular_pipelines.ModularPipelineBlocks`], but the `__call__` method behaves differently. - It receives the iteration variable from the loop wrapper. - It works directly with the [`~modular_pipelines.BlockState`] instead of the [`~modular_pipelines.PipelineState`]. - It doesn't require retrieving or updating the [`~modular_pipelines.BlockState`]. Loop blocks share the same [`~modular_pipelines.BlockState`] to allow values to accumulate and change for each iteration in the loop. ```py class LoopBlock(ModularPipelineBlocks): model_name = "test" @property def inputs(self): return [InputParam(name="x")] @property def intermediate_outputs(self): # outputs produced by this block return [OutputParam(name="x")] @property def description(self): return "I'm a block used inside the `LoopWrapper` class" def __call__(self, components, block_state, i: int): block_state.x += 1 return components, block_state ``` ## LoopSequentialPipelineBlocks Use the [`~modular_pipelines.LoopSequentialPipelineBlocks.from_blocks_dict`] method to add the loop block to the loop wrapper to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. ```py loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock}) ``` Add more loop blocks to run within each iteration with [`~modular_pipelines.LoopSequentialPipelineBlocks.from_blocks_dict`]. This allows you to modify the blocks without changing the loop logic itself. ```py loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock(), "block2": LoopBlock}) ``` ================================================ FILE: docs/source/en/modular_diffusers/mellon.md ================================================ ## Using Custom Blocks with Mellon [Mellon](https://github.com/cubiq/Mellon) is a visual workflow interface that integrates with Modular Diffusers and is designed for node-based workflows. > [!WARNING] > Mellon is in early development and not ready for production use yet. Consider this a sneak peek of how the integration works! Custom blocks work in Mellon out of the box - just need to add a `mellon_pipeline_config.json` to your repository. This config file tells Mellon how to render your block's parameters as UI components. Here's what it looks like in action with the [Gemini Prompt Expander](https://huggingface.co/diffusers/gemini-prompt-expander-mellon) block: ![Mellon custom block demo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/modular_demo_dynamic.gif) To use a modular diffusers custom block in Mellon: 1. Drag a **Dynamic Block Node** from the ModularDiffusers section 2. Enter the `repo_id` (e.g., `diffusers/gemini-prompt-expander-mellon`) 3. Click **Load Custom Block** 4. The node transforms to show your block's inputs and outputs Now let's walk through how to create this config for your own custom block. ## Steps to create a Mellon config 1. **Specify Mellon types for your parameters** - Each `InputParam`/`OutputParam` needs a type that tells Mellon what UI component to render (e.g., `"textbox"`, `"dropdown"`, `"image"`). 2. **Generate `mellon_pipeline_config.json`** - Use our utility to generate a config template and push it to your Hub repository. 3. **(Optional) Manually adjust the config** - Fine-tune the generated config for your specific needs. ## Specify Mellon types for parameters Mellon types determine how each parameter renders in the UI. If you don't specify a type for a parameter, it will default to `"custom"`, which renders as a simple connection dot. You can always adjust this later in the generated config. | Type | Input/Output | Description | |------|--------------|-------------| | `image` | Both | Image (PIL Image) | | `video` | Both | Video | | `text` | Both | Text display | | `textbox` | Input | Text input | | `dropdown` | Input | Dropdown selection menu | | `slider` | Input | Slider for numeric values | | `number` | Input | Numeric input | | `checkbox` | Input | Boolean toggle | For parameters that need more configuration (like dropdowns with options, or sliders with min/max values), pass a `MellonParam` instance directly instead of a string. You can use one of the class methods below, or create a fully custom one with `MellonParam(name, label, type, ...)`. | Method | Description | |--------|-------------| | `MellonParam.Input.image(name)` | Image input | | `MellonParam.Input.textbox(name, default)` | Text input as textarea | | `MellonParam.Input.dropdown(name, options, default)` | Dropdown selection | | `MellonParam.Input.slider(name, default, min, max, step)` | Slider for numeric values | | `MellonParam.Input.number(name, default, min, max, step)` | Numeric input (no slider) | | `MellonParam.Input.seed(name, default)` | Seed input with randomize button | | `MellonParam.Input.checkbox(name, default)` | Boolean checkbox | | `MellonParam.Input.model(name)` | Model input for diffusers components | | `MellonParam.Output.image(name)` | Image output | | `MellonParam.Output.video(name)` | Video output | | `MellonParam.Output.text(name)` | Text output | | `MellonParam.Output.model(name)` | Model output for diffusers components | Choose one of the methods below to specify a Mellon type. ### Using `metadata` in block definitions If you're defining a custom block from scratch, add `metadata={"mellon": ""}` directly to your `InputParam` and `OutputParam` definitions. If you're editing an existing custom block from the Hub, see [Editing custom blocks](./custom_blocks#editing-custom-blocks) for how to download it locally. ```python class GeminiPromptExpander(ModularPipelineBlocks): @property def inputs(self) -> List[InputParam]: return [ InputParam( "prompt", type_hint=str, required=True, description="Prompt to use", metadata={"mellon": "textbox"}, # Text input ) ] @property def intermediate_outputs(self) -> List[OutputParam]: return [ OutputParam( "prompt", type_hint=str, description="Expanded prompt by the LLM", metadata={"mellon": "text"}, # Text output ), OutputParam( "old_prompt", type_hint=str, description="Old prompt provided by the user", # No metadata - we don't want to render this in UI ) ] ``` For full control over UI configuration, pass a `MellonParam` instance directly: ```python from diffusers.modular_pipelines.mellon_node_utils import MellonParam InputParam( "mode", type_hint=str, default="balanced", metadata={"mellon": MellonParam.Input.dropdown("mode", options=["fast", "balanced", "quality"])}, ) ``` ### Using `input_types` and `output_types` when Generating Config If you're working with an existing pipeline or prefer to keep your block definitions clean, specify types when generating the config using the `input_types/output_types` argument: ```python from diffusers.modular_pipelines.mellon_node_utils import MellonPipelineConfig mellon_config = MellonPipelineConfig.from_custom_block( blocks, input_types={"prompt": "textbox"}, output_types={"prompt": "text"} ) ``` > [!NOTE] > When both `metadata` and `input_types`/`output_types` are specified, the arguments overrides `metadata`. ## Generate and push the Mellon config After adding metadata to your block, generate the default Mellon configuration template and push it to the Hub: ```python from diffusers import ModularPipelineBlocks from diffusers.modular_pipelines.mellon_node_utils import MellonPipelineConfig # load your custom blocks from your local dir blocks = ModularPipelineBlocks.from_pretrained("/path/local/folder", trust_remote_code=True) # Generate the default config template mellon_config = MellonPipelineConfig.from_custom_block(blocks) # push the default template to `repo_id`, you will need to pass the same local folder path so that it will save the config locally first mellon_config.save( local_dir="/path/local/folder", repo_id= repo_id, push_to_hub=True ) ``` This creates a `mellon_pipeline_config.json` file in your repository. ## Review and adjust the config The generated template is a starting point - you may want to adjust it for your needs. Let's walk through the generated config for the Gemini Prompt Expander: ```json { "label": "Gemini Prompt Expander", "default_repo": "", "default_dtype": "", "node_params": { "custom": { "params": { "prompt": { "label": "Prompt", "type": "string", "display": "textarea", "default": "" }, "out_prompt": { "label": "Prompt", "type": "string", "display": "output" }, "old_prompt": { "label": "Old Prompt", "type": "custom", "display": "output" }, "doc": { "label": "Doc", "type": "string", "display": "output" } }, "input_names": ["prompt"], "model_input_names": [], "output_names": ["out_prompt", "old_prompt", "doc"], "block_name": "custom", "node_type": "custom" } } } ``` ### Understanding the Structure The `params` dict defines how each UI element renders. The `input_names`, `model_input_names`, and `output_names` lists map these UI elements to the underlying [`ModularPipelineBlocks`]'s I/O interface: | Mellon Config | ModularPipelineBlocks | |---------------|----------------------| | `input_names` | `inputs` property | | `model_input_names` | `expected_components` property | | `output_names` | `intermediate_outputs` property | In this example: `prompt` is the only input. There are no model components, and outputs include `out_prompt`, `old_prompt`, and `doc`. Now let's look at the `params` dict: - **`prompt`**: An input parameter with `display: "textarea"` (renders as a text input box), `label: "Prompt"` (shown in the UI), and `default: ""` (starts empty). The `type: "string"` field is important in Mellon because it determines which nodes can connect together - only matching types can be linked with "noodles". - **`out_prompt`**: The expanded prompt output. The `out_` prefix was automatically added because the input and output share the same name (`prompt`), avoiding naming conflicts in the config. It has `display: "output"` which renders as an output socket. - **`old_prompt`**: Has `type: "custom"` because we didn't specify metadata. This renders as a simple dot in the UI. Since we don't actually want to expose this in the UI, we can remove it. - **`doc`**: The documentation output, automatically added to all custom blocks. ### Making Adjustments Remove `old_prompt` from both `params` and `output_names` because you won't need to use it. ```json { "label": "Gemini Prompt Expander", "default_repo": "", "default_dtype": "", "node_params": { "custom": { "params": { "prompt": { "label": "Prompt", "type": "string", "display": "textarea", "default": "" }, "out_prompt": { "label": "Prompt", "type": "string", "display": "output" }, "doc": { "label": "Doc", "type": "string", "display": "output" } }, "input_names": ["prompt"], "model_input_names": [], "output_names": ["out_prompt", "doc"], "block_name": "custom", "node_type": "custom" } } } ``` See the final config at [diffusers/gemini-prompt-expander-mellon](https://huggingface.co/diffusers/gemini-prompt-expander-mellon). ================================================ FILE: docs/source/en/modular_diffusers/modular_diffusers_states.md ================================================ # States Blocks rely on the [`~modular_pipelines.PipelineState`] and [`~modular_pipelines.BlockState`] data structures for communicating and sharing data. | State | Description | |-------|-------------| | [`~modular_pipelines.PipelineState`] | Maintains the overall data required for a pipeline's execution and allows blocks to read and update its data. | | [`~modular_pipelines.BlockState`] | Allows each block to perform its computation with the necessary data from `inputs`| This guide explains how states work and how they connect blocks. ## PipelineState The [`~modular_pipelines.PipelineState`] is a global state container for all blocks. It maintains the complete runtime state of the pipeline and provides a structured way for blocks to read from and write to shared data. [`~modular_pipelines.PipelineState`] stores all data in a `values` dict, which is a **mutable** state containing user provided input values and intermediate output values generated by blocks. If a block modifies an `input`, it will be reflected in the `values` dict after calling `set_block_state`. ```py PipelineState( values={ 'prompt': 'a cat' 'guidance_scale': 7.0 'num_inference_steps': 25 'prompt_embeds': Tensor(dtype=torch.float32, shape=torch.Size([1, 1, 1, 1])) 'negative_prompt_embeds': None }, ) ``` ## BlockState The [`~modular_pipelines.BlockState`] is a local view of the relevant variables an individual block needs from [`~modular_pipelines.PipelineState`] for performing it's computations. Access these variables directly as attributes like `block_state.image`. ```py BlockState( image: ) ``` When a block's `__call__` method is executed, it retrieves the [`BlockState`] with `self.get_block_state(state)`, performs it's operations, and updates [`~modular_pipelines.PipelineState`] with `self.set_block_state(state, block_state)`. ```py def __call__(self, components, state): # retrieve BlockState block_state = self.get_block_state(state) # computation logic on inputs # update PipelineState self.set_block_state(state, block_state) return components, state ``` ## State interaction [`~modular_pipelines.PipelineState`] and [`~modular_pipelines.BlockState`] interaction is defined by a block's `inputs`, and `intermediate_outputs`. - `inputs`, a block can modify an input - like `block_state.image` - and this change can be propagated globally to [`~modular_pipelines.PipelineState`] by calling `set_block_state`. - `intermediate_outputs`, is a new variable that a block creates. It is added to the [`~modular_pipelines.PipelineState`]'s `values` dict and is available as for subsequent blocks or accessed by users as a final output from the pipeline. ================================================ FILE: docs/source/en/modular_diffusers/modular_pipeline.md ================================================ # ModularPipeline [`ModularPipeline`] converts [`~modular_pipelines.ModularPipelineBlocks`] into an executable pipeline that loads models and performs the computation steps defined in the blocks. It is the main interface for running a pipeline and the API is very similar to [`DiffusionPipeline`] but with a few key differences. - **Loading is lazy.** With [`DiffusionPipeline`], [`~DiffusionPipeline.from_pretrained`] creates the pipeline and loads all models at the same time. With [`ModularPipeline`], creating and loading are two separate steps: [`~ModularPipeline.from_pretrained`] reads the configuration and knows where to load each component from, but doesn't actually load the model weights. You load the models later with [`~ModularPipeline.load_components`], which is where you pass loading arguments like `torch_dtype` and `quantization_config`. - **Two ways to create a pipeline.** You can use [`~ModularPipeline.from_pretrained`] with an existing diffusers model repository — it automatically maps to the default pipeline blocks and then converts to a [`ModularPipeline`] with no extra setup. You can check the [modular_pipelines_directory](https://github.com/huggingface/diffusers/tree/main/src/diffusers/modular_pipelines) to see which models are currently supported. You can also assemble your own pipeline from [`ModularPipelineBlocks`] and convert it with the [`~ModularPipelineBlocks.init_pipeline`] method (see [Creating a pipeline](#creating-a-pipeline) for more details). - **Running the pipeline is the same.** Once loaded, you call the pipeline with the same arguments you're used to. A single [`ModularPipeline`] can support multiple workflows (text-to-image, image-to-image, inpainting, etc.) when the pipeline blocks use [`AutoPipelineBlocks`](./auto_pipeline_blocks) to automatically select the workflow based on your inputs. Below are complete examples for text-to-image, image-to-image, and inpainting with SDXL. ```py import torch from diffusers import ModularPipeline pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") pipeline.load_components(torch_dtype=torch.float16) pipeline.to("cuda") image = pipeline(prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0] image.save("modular_t2i_out.png") ``` ```py import torch from diffusers import ModularPipeline from diffusers.utils import load_image pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") pipeline.load_components(torch_dtype=torch.float16) pipeline.to("cuda") url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" init_image = load_image(url) prompt = "a dog catching a frisbee in the jungle" image = pipeline(prompt=prompt, image=init_image, strength=0.8).images[0] image.save("modular_i2i_out.png") ``` ```py import torch from diffusers import ModularPipeline from diffusers.utils import load_image pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") pipeline.load_components(torch_dtype=torch.float16) pipeline.to("cuda") img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" init_image = load_image(img_url) mask_image = load_image(mask_url) prompt = "A deep sea diver floating" image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85).images[0] image.save("modular_inpaint_out.png") ``` This guide will show you how to create a [`ModularPipeline`], manage its components, and run the pipeline. ## Creating a pipeline There are two ways to create a [`ModularPipeline`]. Assemble and create a pipeline from [`ModularPipelineBlocks`] with [`~ModularPipelineBlocks.init_pipeline`], or load an existing pipeline with [`~ModularPipeline.from_pretrained`]. You can also initialize a [`ComponentsManager`](./components_manager) to handle device placement and memory management. If you don't need automatic offloading, you can skip this and move the pipeline to your device manually with `pipeline.to("cuda")`. > [!TIP] > Refer to the [ComponentsManager](./components_manager) doc for more details about how it can help manage components across different workflows. ### init_pipeline [`~ModularPipelineBlocks.init_pipeline`] converts any [`ModularPipelineBlocks`] into a [`ModularPipeline`]. Let's define a minimal block to see how it works: ```py from transformers import CLIPTextModel from diffusers.modular_pipelines import ( ComponentSpec, ModularPipelineBlocks, PipelineState, ) class MyBlock(ModularPipelineBlocks): @property def expected_components(self): return [ ComponentSpec( name="text_encoder", type_hint=CLIPTextModel, pretrained_model_name_or_path="openai/clip-vit-large-patch14", ), ] def __call__(self, components, state: PipelineState) -> PipelineState: return components, state ``` Call [`~ModularPipelineBlocks.init_pipeline`] to convert it into a pipeline. The `blocks` attribute on the pipeline is the blocks it was created from — it determines the expected inputs, outputs, and computation logic. ```py block = MyBlock() pipe = block.init_pipeline() pipe.blocks ``` ``` MyBlock { "_class_name": "MyBlock", "_diffusers_version": "0.37.0.dev0" } ``` > [!WARNING] > Blocks are mutable — you can freely add, remove, or swap blocks before creating a pipeline. However, once a pipeline is created, modifying `pipeline.blocks` won't affect the pipeline because it returns a copy. If you want a different block structure, create a new pipeline after modifying the blocks. When you call [`~ModularPipelineBlocks.init_pipeline`] without a repository, it uses the `pretrained_model_name_or_path` defined in the block's [`ComponentSpec`] to determine where to load each component from. Printing the pipeline shows the component loading configuration. ```py pipe ModularPipeline { "_blocks_class_name": "MyBlock", "_class_name": "ModularPipeline", "_diffusers_version": "0.37.0.dev0", "text_encoder": [ null, null, { "pretrained_model_name_or_path": "openai/clip-vit-large-patch14", "revision": null, "subfolder": "", "type_hint": [ "transformers", "CLIPTextModel" ], "variant": null } ] } ``` If you pass a repository to [`~ModularPipelineBlocks.init_pipeline`], it overrides the loading path by matching your block's components against the pipeline config in that repository (`model_index.json` or `modular_model_index.json`). In the example below, the `pretrained_model_name_or_path` will be updated to `"stabilityai/stable-diffusion-xl-base-1.0"`. ```py pipe = block.init_pipeline("stabilityai/stable-diffusion-xl-base-1.0") pipe ModularPipeline { "_blocks_class_name": "MyBlock", "_class_name": "ModularPipeline", "_diffusers_version": "0.37.0.dev0", "text_encoder": [ null, null, { "pretrained_model_name_or_path": "stabilityai/stable-diffusion-xl-base-1.0", "revision": null, "subfolder": "text_encoder", "type_hint": [ "transformers", "CLIPTextModel" ], "variant": null } ] } ``` If a component in your block doesn't exist in the repository, it remains `null` and is skipped during [`~ModularPipeline.load_components`]. ### from_pretrained [`~ModularPipeline.from_pretrained`] is a convenient way to create a [`ModularPipeline`] without defining blocks yourself. It works with three types of repositories. **A regular diffusers repository.** Pass any supported model repository and it automatically maps to the default pipeline blocks. Currently supported models include SDXL, Wan, Qwen, Z-Image, Flux, and Flux2. ```py from diffusers import ModularPipeline, ComponentsManager components = ComponentsManager() pipeline = ModularPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", components_manager=components ) ``` **A modular repository.** These repositories contain a `modular_model_index.json` that specifies where to load each component from — the components can come from different repositories and the modular repository itself may not contain any model weights. For example, [diffusers/flux2-bnb-4bit-modular](https://huggingface.co/diffusers/flux2-bnb-4bit-modular) loads a quantized transformer from one repository and the remaining components from another. See [Modular repository](#modular-repository) for more details on the format. ```py from diffusers import ModularPipeline, ComponentsManager components = ComponentsManager() pipeline = ModularPipeline.from_pretrained( "diffusers/flux2-bnb-4bit-modular", components_manager=components ) ``` **A modular repository with custom code.** Some repositories include custom pipeline blocks alongside the loading configuration. Add `trust_remote_code=True` to load them. See [Custom blocks](./custom_blocks) for how to create your own. ```py from diffusers import ModularPipeline, ComponentsManager components = ComponentsManager() pipeline = ModularPipeline.from_pretrained( "diffusers/Florence2-image-Annotator", trust_remote_code=True, components_manager=components ) ``` ## Loading components A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load components with [`~ModularPipeline.load_components`]. This will load all the components that have a valid loading spec. ```py import torch pipeline.load_components(torch_dtype=torch.float16) ``` You can also load specific components by name. The example below only loads the `text_encoder`. ```py pipeline.load_components(names=["text_encoder"], torch_dtype=torch.float16) ``` After loading, printing the pipeline shows which components are loaded — the first two fields change from `null` to the component's library and class. ```py pipeline ``` ``` # text_encoder is loaded - shows library and class "text_encoder": [ "transformers", "CLIPTextModel", { ... } ] # unet is not loaded yet - still null "unet": [ null, null, { ... } ] ``` Loading keyword arguments like `torch_dtype`, `variant`, `revision`, and `quantization_config` are passed through to `from_pretrained()` for each component. You can pass a single value to apply to all components, or a dict to set per-component values. ```py # apply bfloat16 to all components pipeline.load_components(torch_dtype=torch.bfloat16) # different dtypes per component pipeline.load_components(torch_dtype={"transformer": torch.bfloat16, "default": torch.float32}) ``` [`~ModularPipeline.load_components`] only loads components that haven't been loaded yet and have a valid loading spec. This means if you've already set a component on the pipeline, calling [`~ModularPipeline.load_components`] again won't reload it. ## Updating components [`~ModularPipeline.update_components`] replaces a component on the pipeline with a new one. When a component is updated, the loading specifications are also updated in the pipeline config and [`~ModularPipeline.load_components`] will skip it on subsequent calls. ### From AutoModel You can pass a model object loaded with `AutoModel.from_pretrained()`. Models loaded this way are automatically tagged with their loading information. ```py from diffusers import AutoModel unet = AutoModel.from_pretrained( "RunDiffusion/Juggernaut-XL-v9", subfolder="unet", variant="fp16", torch_dtype=torch.float16 ) pipeline.update_components(unet=unet) ``` ### From ComponentSpec Use [`~ModularPipeline.get_component_spec`] to get a copy of the current component specification, modify it, and load a new component. ```py unet_spec = pipeline.get_component_spec("unet") # modify to load from a different repository unet_spec.pretrained_model_name_or_path = "RunDiffusion/Juggernaut-XL-v9" # load and update unet = unet_spec.load(torch_dtype=torch.float16) pipeline.update_components(unet=unet) ``` You can also create a [`ComponentSpec`] from scratch. Not all components are loaded from pretrained weights — some are created from a config (listed under `pipeline.config_component_names`). For these, use [`~ComponentSpec.create`] instead of [`~ComponentSpec.load`]. ```py guider_spec = pipeline.get_component_spec("guider") guider_spec.config = {"guidance_scale": 5.0} guider = guider_spec.create() pipeline.update_components(guider=guider) ``` Or simply pass the object directly. ```py from diffusers.guiders import ClassifierFreeGuidance guider = ClassifierFreeGuidance(guidance_scale=5.0) pipeline.update_components(guider=guider) ``` See the [Guiders](../using-diffusers/guiders) guide for more details on available guiders and how to configure them. ## Splitting a pipeline into stages Since blocks are composable, you can take a pipeline apart and reconstruct it into separate pipelines for each stage. The example below shows how we can separate the text encoder block from the rest of the pipeline, so you can encode the prompt independently and pass the embeddings to the main pipeline. ```py from diffusers import ModularPipeline, ComponentsManager import torch device = "cuda" dtype = torch.bfloat16 repo_id = "black-forest-labs/FLUX.2-klein-4B" # get the blocks and separate out the text encoder blocks = ModularPipeline.from_pretrained(repo_id).blocks text_block = blocks.sub_blocks.pop("text_encoder") # use ComponentsManager to handle offloading across multiple pipelines manager = ComponentsManager() manager.enable_auto_cpu_offload(device=device) # create separate pipelines for each stage text_encoder_pipeline = text_block.init_pipeline(repo_id, components_manager=manager) pipeline = blocks.init_pipeline(repo_id, components_manager=manager) # encode text text_encoder_pipeline.load_components(torch_dtype=dtype) text_embeddings = text_encoder_pipeline(prompt="a cat").get_by_kwargs("denoiser_input_fields") # denoise and decode pipeline.load_components(torch_dtype=dtype) output = pipeline( **text_embeddings, num_inference_steps=4, ).images[0] ``` [`ComponentsManager`] handles memory across multiple pipelines. Unlike the offloading strategies in [`DiffusionPipeline`] that follow a fixed order, [`ComponentsManager`] makes offloading decisions dynamically each time a model forward pass runs, based on the current memory situation. This means it works regardless of how many pipelines you create or what order you run them in. See the [ComponentsManager](./components_manager) guide for more details. If pipeline stages share components (e.g., the same VAE used for encoding and decoding), you can use [`~ModularPipeline.update_components`] to pass an already-loaded component to another pipeline instead of loading it again. ## Modular repository A repository is required if the pipeline blocks use *pretrained components*. The repository supplies loading specifications and metadata. [`ModularPipeline`] works with regular diffusers repositories out of the box. However, you can also create a *modular repository* for more flexibility. A modular repository contains a `modular_model_index.json` file containing the following 3 elements. - `library` and `class` shows which library the component was loaded from and its class. If `null`, the component hasn't been loaded yet. - `loading_specs_dict` contains the information required to load the component such as the repository and subfolder it is loaded from. The key advantage of a modular repository is that components can be loaded from different repositories. For example, [diffusers/flux2-bnb-4bit-modular](https://huggingface.co/diffusers/flux2-bnb-4bit-modular) loads a quantized transformer from `diffusers/FLUX.2-dev-bnb-4bit` while loading the remaining components from `black-forest-labs/FLUX.2-dev`. To convert a regular diffusers repository into a modular one, create the pipeline using the regular repository, and then push to the Hub. The saved repository will contain a `modular_model_index.json` with all the loading specifications. ```py from diffusers import ModularPipeline # load from a regular repo pipeline = ModularPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") # push as a modular repository pipeline.save_pretrained("local/path", repo_id="my-username/sdxl-modular", push_to_hub=True) ``` A modular repository can also include custom pipeline blocks as Python code. This allows you to share specialized blocks that aren't native to Diffusers. For example, [diffusers/Florence2-image-Annotator](https://huggingface.co/diffusers/Florence2-image-Annotator) contains custom blocks alongside the loading configuration: ``` Florence2-image-Annotator/ ├── block.py # Custom pipeline blocks implementation ├── config.json # Pipeline configuration and auto_map ├── mellon_config.json # UI configuration for Mellon └── modular_model_index.json # Component loading specifications ``` The `config.json` file contains an `auto_map` key that tells [`ModularPipeline`] where to find the custom blocks: ```json { "_class_name": "Florence2AnnotatorBlocks", "auto_map": { "ModularPipelineBlocks": "block.Florence2AnnotatorBlocks" } } ``` Load custom code repositories with `trust_remote_code=True` as shown in [from_pretrained](#from_pretrained). See [Custom blocks](./custom_blocks) for how to create and share your own. ================================================ FILE: docs/source/en/modular_diffusers/overview.md ================================================ # Overview > [!WARNING] > Modular Diffusers is under active development and it's API may change. Modular Diffusers is a unified pipeline system that simplifies your workflow with *pipeline blocks*. - Blocks are reusable and you only need to create new blocks that are unique to your pipeline. - Blocks can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows. The Modular Diffusers docs are organized as shown below. ## Quickstart - The [quickstart](./quickstart) shows you how to run a modular pipeline, understand its structure, and customize it by modifying the blocks that compose it. ## ModularPipelineBlocks - [States](./modular_diffusers_states) explains how data is shared and communicated between blocks and [`ModularPipeline`]. - [ModularPipelineBlocks](./pipeline_block) is the most basic unit of a [`ModularPipeline`] and this guide shows you how to create one. - [SequentialPipelineBlocks](./sequential_pipeline_blocks) is a type of block that chains multiple blocks so they run one after another, passing data along the chain. This guide shows you how to create [`~modular_pipelines.SequentialPipelineBlocks`] and how they connect and work together. - [LoopSequentialPipelineBlocks](./loop_sequential_pipeline_blocks) is a type of block that runs a series of blocks in a loop. This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBlocks`]. - [AutoPipelineBlocks](./auto_pipeline_blocks) is a type of block that automatically chooses which blocks to run based on the input. This guide shows you how to create [`~modular_pipelines.AutoPipelineBlocks`]. - [Building Custom Blocks](./custom_blocks) shows you how to create your own custom blocks and share them on the Hub. ## ModularPipeline - [ModularPipeline](./modular_pipeline) shows you how to create and convert pipeline blocks into an executable [`ModularPipeline`]. - [ComponentsManager](./components_manager) shows you how to manage and reuse components across multiple pipelines. - [Guiders](../using-diffusers/guiders) shows you how to use different guidance methods in the pipeline. ## Mellon Integration - [Using Custom Blocks with Mellon](./mellon) shows you how to make your custom blocks work with [Mellon](https://github.com/cubiq/Mellon), a visual node-based interface for building workflows. ================================================ FILE: docs/source/en/modular_diffusers/pipeline_block.md ================================================ # ModularPipelineBlocks [`~modular_pipelines.ModularPipelineBlocks`] is the basic block for building a [`ModularPipeline`]. It defines what components, inputs/outputs, and computation a block should perform for a specific step in a pipeline. A [`~modular_pipelines.ModularPipelineBlocks`] connects with other blocks, using [state](./modular_diffusers_states), to enable the modular construction of workflows. A [`~modular_pipelines.ModularPipelineBlocks`] on it's own can't be executed. It is a blueprint for what a step should do in a pipeline. To actually run and execute a pipeline, the [`~modular_pipelines.ModularPipelineBlocks`] needs to be converted into a [`ModularPipeline`]. This guide will show you how to create a [`~modular_pipelines.ModularPipelineBlocks`]. ## Inputs and outputs > [!TIP] > Refer to the [States](./modular_diffusers_states) guide if you aren't familiar with how state works in Modular Diffusers. A [`~modular_pipelines.ModularPipelineBlocks`] requires `inputs`, and `intermediate_outputs`. - `inputs` are values a block reads from the [`~modular_pipelines.PipelineState`] to perform its computation. These can be values provided by a user (like a prompt or image) or values produced by a previous block (like encoded `image_latents`). Use `InputParam` to define `inputs`. ```py class ImageEncodeStep(ModularPipelineBlocks): ... @property def inputs(self): return [ InputParam(name="image", type_hint="PIL.Image", required=True, description="raw input image to process"), ] ... ``` - `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `inputs` for subsequent blocks or available as the final output from running the pipeline. Use `OutputParam` to define `intermediate_outputs`. ```py class ImageEncodeStep(ModularPipelineBlocks): ... @property def intermediate_outputs(self): return [ OutputParam(name="image_latents", description="latents representing the image"), ] ... ``` The intermediate inputs and outputs share data to connect blocks. They are accessible at any point, allowing you to track the workflow's progress. ## Components and configs The components and pipeline-level configs a block needs are specified in [`ComponentSpec`] and [`~modular_pipelines.ConfigSpec`]. - [`ComponentSpec`] contains the expected components used by a block. You need the `name` of the component and ideally a `type_hint` that specifies exactly what the component is. - [`~modular_pipelines.ConfigSpec`] contains pipeline-level settings that control behavior across all blocks. ```py class ImageEncodeStep(ModularPipelineBlocks): ... @property def expected_components(self): return [ ComponentSpec(name="vae", type_hint=AutoencoderKL), ] @property def expected_configs(self): return [ ConfigSpec("force_zeros_for_empty_prompt", True), ] ... ``` When the blocks are converted into a pipeline, the components become available to the block as the first argument in `__call__`. ## Computation logic The computation a block performs is defined in the `__call__` method and it follows a specific structure. 1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs`. 2. Implement the computation logic on the `inputs`. 3. Update [`~modular_pipelines.PipelineState`] to push changes from the local [`~modular_pipelines.BlockState`] back to the global [`~modular_pipelines.PipelineState`]. 4. Return the components and state which becomes available to the next block. ```py class ImageEncodeStep(ModularPipelineBlocks): def __call__(self, components, state): # Get a local view of the state variables this block needs block_state = self.get_block_state(state) # Your computation logic here # block_state contains all your inputs # Access them like: block_state.image, block_state.processed_image # Update the pipeline state with your updated block_states self.set_block_state(state, block_state) return components, state ``` ## Putting it all together Here is the complete block with all the pieces connected. ```py from diffusers import ComponentSpec, AutoencoderKL from diffusers.modular_pipelines import InputParam, ModularPipelineBlocks, OutputParam class ImageEncodeStep(ModularPipelineBlocks): @property def description(self): return "Encode an image into latent space." @property def expected_components(self): return [ ComponentSpec(name="vae", type_hint=AutoencoderKL), ] @property def inputs(self): return [ InputParam(name="image", type_hint="PIL.Image", required=True, description="raw input image to process"), ] @property def intermediate_outputs(self): return [ OutputParam(name="image_latents", type_hint="torch.Tensor", description="latents representing the image"), ] def __call__(self, components, state): block_state = self.get_block_state(state) block_state.image_latents = components.vae.encode(block_state.image) self.set_block_state(state, block_state) return components, state ``` Every block has a `doc` property that is automatically generated from the properties you defined above. It provides a summary of the block's description, components, inputs, and outputs. ```py block = ImageEncoderStep() print(block.doc) class ImageEncodeStep Encode an image into latent space. Components: vae (`AutoencoderKL`) Inputs: image (`PIL.Image`): raw input image to process Outputs: image_latents (`torch.Tensor`): latents representing the image ``` ================================================ FILE: docs/source/en/modular_diffusers/quickstart.md ================================================ # Quickstart Modular Diffusers is a framework for quickly building flexible and customizable pipelines. These pipelines can go beyond what standard `DiffusionPipeline`s can do. At the core of Modular Diffusers are [`ModularPipelineBlocks`] that can be combined with other blocks to adapt to new workflows. The blocks are converted into a [`ModularPipeline`], a friendly user-facing interface for running generation tasks. This guide shows you how to run a modular pipeline, understand its structure, and customize it by modifying the blocks that compose it. ## Run a pipeline [`ModularPipeline`] is the main interface for loading, running, and managing modular pipelines. ```py import torch from diffusers import ModularPipeline, ComponentsManager # Use ComponentsManager to enable auto CPU offloading for memory efficiency manager = ComponentsManager() manager.enable_auto_cpu_offload(device="cuda:0") pipe = ModularPipeline.from_pretrained("Qwen/Qwen-Image", components_manager=manager) pipe.load_components(torch_dtype=torch.bfloat16) image = pipe( prompt="cat wizard with red hat, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney", ).images[0] image ``` [`~ModularPipeline.from_pretrained`] uses lazy loading - it reads the configuration to learn where to load each component from, but doesn't actually load the model weights until you call [`~ModularPipeline.load_components`]. This gives you control over when and how components are loaded. > [!TIP] > `ComponentsManager` with `enable_auto_cpu_offload` automatically moves models between CPU and GPU as needed, reducing memory usage for large models like Qwen-Image. Learn more in the [ComponentsManager](./components_manager) guide. > > If you don't need offloading, remove the `components_manager` argument and move the pipeline to your device manually with `to("cuda")`. Learn more about creating and loading pipelines in the [Creating a pipeline](https://huggingface.co/docs/diffusers/modular_diffusers/modular_pipeline#creating-a-pipeline) and [Loading components](https://huggingface.co/docs/diffusers/modular_diffusers/modular_pipeline#loading-components) guides. ## Understand the structure A [`ModularPipeline`] has two parts: a **definition** (the blocks) and a **state** (the loaded components and configs). Print the pipeline to see its state — the components and their loading status and configuration. ```py print(pipe) ``` ``` QwenImageModularPipeline { "_blocks_class_name": "QwenImageAutoBlocks", "_class_name": "QwenImageModularPipeline", "_diffusers_version": "0.37.0.dev0", "transformer": [ "diffusers", "QwenImageTransformer2DModel", { "pretrained_model_name_or_path": "Qwen/Qwen-Image", "revision": null, "subfolder": "transformer", "type_hint": [ "diffusers", "QwenImageTransformer2DModel" ], "variant": null } ], ... } ``` Access the definition through `pipe.blocks` — this is the [`~modular_pipelines.ModularPipelineBlocks`] that defines the pipeline's workflows, inputs, outputs, and computation logic. ```py print(pipe.blocks) ``` ``` QwenImageAutoBlocks( Class: SequentialPipelineBlocks Description: Auto Modular pipeline for text-to-image, image-to-image, inpainting, and controlnet tasks using QwenImage. Supported workflows: - `text2image`: requires `prompt` - `image2image`: requires `prompt`, `image` - `inpainting`: requires `prompt`, `mask_image`, `image` - `controlnet_text2image`: requires `prompt`, `control_image` ... Components: text_encoder (`Qwen2_5_VLForConditionalGeneration`) vae (`AutoencoderKLQwenImage`) transformer (`QwenImageTransformer2DModel`) ... Sub-Blocks: [0] text_encoder (QwenImageAutoTextEncoderStep) [1] vae_encoder (QwenImageAutoVaeEncoderStep) [2] controlnet_vae_encoder (QwenImageOptionalControlNetVaeEncoderStep) [3] denoise (QwenImageAutoCoreDenoiseStep) [4] decode (QwenImageAutoDecodeStep) ) ``` The output returns: - The supported workflows (text2image, image2image, inpainting, etc.) - The Sub-Blocks it's composed of (text_encoder, vae_encoder, denoise, decode) ### Workflows This pipeline supports multiple workflows and adapts its behavior based on the inputs you provide. For example, if you pass `image` to the pipeline, it runs an image-to-image workflow instead of text-to-image. Learn more about how this works under the hood in the [AutoPipelineBlocks](https://huggingface.co/docs/diffusers/modular_diffusers/auto_pipeline_blocks) guide. ```py from diffusers.utils import load_image input_image = load_image("https://github.com/Trgtuan10/Image_storage/blob/main/cute_cat.png?raw=true") image = pipe( prompt="cat wizard with red hat, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney", image=input_image, ).images[0] ``` Use `get_workflow()` to extract the blocks for a specific workflow. Pass the workflow name (e.g., `"image2image"`, `"inpainting"`, `"controlnet_text2image"`) to get only the blocks relevant to that workflow. This is useful when you want to customize or debug a specific workflow. You can check `pipe.blocks.available_workflows` to see all available workflows. ```py img2img_blocks = pipe.blocks.get_workflow("image2image") ``` ### Sub-blocks Blocks can contain other blocks. `pipe.blocks` gives you the top-level block definition (here, `QwenImageAutoBlocks`), while `sub_blocks` lets you access the smaller blocks inside it. `QwenImageAutoBlocks` is composed of: `text_encoder`, `vae_encoder`, `controlnet_vae_encoder`, `denoise`, and `decode`. These sub-blocks run one after another and data flows linearly from one block to the next — each block's `intermediate_outputs` become available as `inputs` to the next block. This is how [`SequentialPipelineBlocks`](./sequential_pipeline_blocks) work. You can access them through the `sub_blocks` property. The `doc` property is useful for seeing the full documentation of any block, including its inputs, outputs, and components. ```py vae_encoder_block = pipe.blocks.sub_blocks["vae_encoder"] print(vae_encoder_block.doc) ``` This block can be converted to a pipeline so that it can run on its own with [`~ModularPipelineBlocks.init_pipeline`]. ```py vae_encoder_pipe = vae_encoder_block.init_pipeline() # Reuse the VAE we already loaded, we can reuse it with update_components() method vae_encoder_pipe.update_components(vae=pipe.vae) # Run just this block image_latents = vae_encoder_pipe(image=input_image).image_latents print(image_latents.shape) ``` It reuses the VAE from our original pipeline instead of reloading it, keeping memory usage efficient. Learn more in the [Loading components](https://huggingface.co/docs/diffusers/modular_diffusers/modular_pipeline#loading-components) guide. Since blocks are composable, you can modify the pipeline's definition by adding, removing, or swapping blocks to create new workflows. In the next section, we'll add a canny edge detection block to a ControlNet pipeline, so you can pass a regular image instead of a pre-processed canny edge map. ## Compose new workflows Let's add a canny edge detection block to a ControlNet pipeline. First, load a pre-built canny block from the Hub (see [Building Custom Blocks](https://huggingface.co/docs/diffusers/modular_diffusers/custom_blocks) to create your own). ```py from diffusers.modular_pipelines import ModularPipelineBlocks # Load a canny block from the Hub canny_block = ModularPipelineBlocks.from_pretrained( "diffusers-internal-dev/canny-filtering", trust_remote_code=True, ) print(canny_block.doc) ``` ``` class CannyBlock Inputs: image (`Union[Image, ndarray]`): Image to compute canny filter on low_threshold (`int`, *optional*, defaults to 50): Low threshold for the canny filter. high_threshold (`int`, *optional*, defaults to 200): High threshold for the canny filter. ... Outputs: control_image (`PIL.Image`): Canny map for input image ``` Use `get_workflow` to extract the ControlNet workflow from [`QwenImageAutoBlocks`]. ```py # Get the controlnet workflow that we want to work with blocks = pipe.blocks.get_workflow("controlnet_text2image") print(blocks.doc) ``` ``` class SequentialPipelineBlocks Inputs: prompt (`str`): The prompt or prompts to guide image generation. control_image (`Image`): Control image for ControlNet conditioning. ... ``` The extracted workflow is a [`SequentialPipelineBlocks`](./sequential_pipeline_blocks) and it currently requires `control_image` as input. Insert the canny block at the beginning so the pipeline accepts a regular image instead. ```py # Insert canny at the beginning blocks.sub_blocks.insert("canny", canny_block, 0) # Check the updated structure: CannyBlock is now listed as first sub-block print(blocks) # Check the updated doc print(blocks.doc) ``` ``` class SequentialPipelineBlocks Inputs: image (`Union[Image, ndarray]`): Image to compute canny filter on low_threshold (`int`, *optional*, defaults to 50): Low threshold for the canny filter. high_threshold (`int`, *optional*, defaults to 200): High threshold for the canny filter. prompt (`str`): The prompt or prompts to guide image generation. ... ``` Now the pipeline takes `image` as input instead of `control_image`. Because blocks in a sequence share data automatically, the canny block's output (`control_image`) flows to the denoise block that needs it, and the canny block's input (`image`) becomes a pipeline input since no earlier block provides it. Create a pipeline from the modified blocks and load a ControlNet model. The ControlNet isn't part of the original model repository, so load it separately and add it with [`~ModularPipeline.update_components`]. ```py pipeline = blocks.init_pipeline("Qwen/Qwen-Image", components_manager=manager) pipeline.load_components(torch_dtype=torch.bfloat16) # Load the ControlNet model controlnet_spec = pipeline.get_component_spec("controlnet") controlnet_spec.pretrained_model_name_or_path = "InstantX/Qwen-Image-ControlNet-Union" controlnet = controlnet_spec.load(torch_dtype=torch.bfloat16) pipeline.update_components(controlnet=controlnet) ``` Now run the pipeline - the canny block preprocesses the image for ControlNet. ```py from diffusers.utils import load_image prompt = "cat wizard with red hat, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney" image = load_image("https://github.com/Trgtuan10/Image_storage/blob/main/cute_cat.png?raw=true") output = pipeline( prompt=prompt, image=image, ).images[0] output ``` ## Next steps Understand the core building blocks of Modular Diffusers: - [ModularPipelineBlocks](./pipeline_block): The basic unit for defining a step in a pipeline. - [SequentialPipelineBlocks](./sequential_pipeline_blocks): Chain blocks to run in sequence. - [AutoPipelineBlocks](./auto_pipeline_blocks): Create pipelines that support multiple workflows. - [States](./modular_diffusers_states): How data is shared between blocks. Learn how to create your own blocks with custom logic in the [Building Custom Blocks](./custom_blocks) guide. Use [`ComponentsManager`](./components_manager) to share models across multiple pipelines and manage memory efficiently. Connect modular pipelines to [Mellon](https://github.com/cubiq/Mellon), a visual node-based interface for building workflows. Custom blocks built with Modular Diffusers work out of the box with Mellon - no UI code required. Read more in the Mellon guide. ================================================ FILE: docs/source/en/modular_diffusers/sequential_pipeline_blocks.md ================================================ # SequentialPipelineBlocks [`~modular_pipelines.SequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a sequence. Data flows linearly from one block to the next using `inputs` and `intermediate_outputs`. Each block in [`~modular_pipelines.SequentialPipelineBlocks`] usually represents a step in the pipeline, and by combining them, you gradually build a pipeline. This guide shows you how to connect two blocks into a [`~modular_pipelines.SequentialPipelineBlocks`]. Create two [`~modular_pipelines.ModularPipelineBlocks`]. The first block, `InputBlock`, outputs a `batch_size` value and the second block, `ImageEncoderBlock` uses `batch_size` as `inputs`. ```py from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam class InputBlock(ModularPipelineBlocks): @property def inputs(self): return [ InputParam(name="prompt", type_hint=list, description="list of text prompts"), InputParam(name="num_images_per_prompt", type_hint=int, description="number of images per prompt"), ] @property def intermediate_outputs(self): return [ OutputParam(name="batch_size", description="calculated batch size"), ] @property def description(self): return "A block that determines batch_size based on the number of prompts and num_images_per_prompt argument." def __call__(self, components, state): block_state = self.get_block_state(state) batch_size = len(block_state.prompt) block_state.batch_size = batch_size * block_state.num_images_per_prompt self.set_block_state(state, block_state) return components, state ``` ```py import torch from diffusers.modular_pipelines import ModularPipelineBlocks, InputParam, OutputParam class ImageEncoderBlock(ModularPipelineBlocks): @property def inputs(self): return [ InputParam(name="image", type_hint="PIL.Image", description="raw input image to process"), InputParam(name="batch_size", type_hint=int), ] @property def intermediate_outputs(self): return [ OutputParam(name="image_latents", description="latents representing the image"), ] @property def description(self): return "Encode raw image into its latent presentation" def __call__(self, components, state): block_state = self.get_block_state(state) # Simulate processing the image # This will change the state of the image from a PIL image to a tensor for all blocks block_state.image = torch.randn(1, 3, 512, 512) block_state.batch_size = block_state.batch_size * 2 block_state.image_latents = torch.randn(1, 4, 64, 64) self.set_block_state(state, block_state) return components, state ``` Connect the two blocks by defining a [`~modular_pipelines.SequentialPipelineBlocks`]. List the block instances in `block_classes` and their corresponding names in `block_names`. The blocks are executed in the order they appear in `block_classes`, and data flows from one block to the next through [`~modular_pipelines.PipelineState`]. ```py class ImageProcessingStep(SequentialPipelineBlocks): """ # auto_docstring """ model_name = "my_model" block_classes = [InputBlock(), ImageEncoderBlock()] block_names = ["input", "image_encoder"] @property def description(self): return ( "Process text prompts and images for the pipeline. It:\n" " - Determines the batch size from the prompts.\n" " - Encodes the image into latent space." ) ``` When you create a [`~modular_pipelines.SequentialPipelineBlocks`], properties like `inputs`, `intermediate_outputs`, and `expected_components` are automatically aggregated from the sub-blocks, so there is no need to define them again. There are a few properties you should set: - `description`: We recommend adding a description for the assembled block to explain what the combined step does. - `model_name`: This is automatically derived from the sub-blocks but isn't always correct, so you may need to override it. - `outputs`: By default this is the same as `intermediate_outputs`, but you can manually set it to control which values appear in the doc. This is useful for showing only the final outputs instead of all intermediate values. These properties, together with the aggregated `inputs`, `intermediate_outputs`, and `expected_components`, are used to automatically generate the `doc` property. Print the `ImageProcessingStep` block to inspect its sub-blocks, and use `doc` for a full summary of the block's inputs, outputs, and components. ```py blocks = ImageProcessingStep() print(blocks) print(blocks.doc) ``` ================================================ FILE: docs/source/en/optimization/attention_backends.md ================================================ # Attention backends > [!NOTE] > The attention dispatcher is an experimental feature. Please open an issue if you have any feedback or encounter any problems. Diffusers provides several optimized attention algorithms that are more memory and computationally efficient through it's *attention dispatcher*. The dispatcher acts as a router for managing and switching between different attention implementations and provides a unified interface for interacting with them. Refer to the table below for an overview of the available attention families and to the [Available backends](#available-backends) section for a more complete list. | attention family | main feature | |---|---| | FlashAttention | minimizes memory reads/writes through tiling and recomputation | | AI Tensor Engine for ROCm | FlashAttention implementation optimized for AMD ROCm accelerators | | SageAttention | quantizes attention to int8 | | PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) | | xFormers | memory-efficient attention with support for various attention kernels | This guide will show you how to set and use the different attention backends. ## set_attention_backend The [`~ModelMixin.set_attention_backend`] method iterates through all the modules in the model and sets the appropriate attention backend to use. The attention backend setting persists until [`~ModelMixin.reset_attention_backend`] is called. The example below demonstrates how to enable the `_flash_3_hub` implementation for FlashAttention-3 from the [`kernels`](https://github.com/huggingface/kernels) library, which allows you to instantly use optimized compute kernels from the Hub without requiring any setup. > [!NOTE] > FlashAttention-3 is not supported for non-Hopper architectures, in which case, use FlashAttention with `set_attention_backend("flash")`. ```py import torch from diffusers import QwenImagePipeline pipeline = QwenImagePipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.transformer.set_attention_backend("_flash_3_hub") prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] ``` To restore the default attention backend, call [`~ModelMixin.reset_attention_backend`]. ```py pipeline.transformer.reset_attention_backend() ``` ## attention_backend context manager The [attention_backend](https://github.com/huggingface/diffusers/blob/5e181eddfe7e44c1444a2511b0d8e21d177850a0/src/diffusers/models/attention_dispatch.py#L225) context manager temporarily sets an attention backend for a model within the context. Outside the context, the default attention (PyTorch's native scaled dot product attention) is used. This is useful if you want to use different backends for different parts of a pipeline or if you want to test the different backends. ```py import torch from diffusers import QwenImagePipeline pipeline = QwenImagePipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" ) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ with attention_backend("_flash_3_hub"): image = pipeline(prompt).images[0] ``` > [!TIP] > Most attention backends support `torch.compile` without graph breaks and can be used to further speed up inference. ## Checks The attention dispatcher includes debugging checks that catch common errors before they cause problems. 1. Device checks verify that query, key, and value tensors live on the same device. 2. Data type checks confirm tensors have matching dtypes and use either bfloat16 or float16. 3. Shape checks validate tensor dimensions and prevent mixing attention masks with causal flags. Enable these checks by setting the `DIFFUSERS_ATTN_CHECKS` environment variable. Checks add overhead to every attention operation, so they're disabled by default. ```bash export DIFFUSERS_ATTN_CHECKS=yes ``` The checks are run now before every attention operation. ```py import torch query = torch.randn(1, 10, 8, 64, dtype=torch.bfloat16, device="cuda") key = torch.randn(1, 10, 8, 64, dtype=torch.bfloat16, device="cuda") value = torch.randn(1, 10, 8, 64, dtype=torch.bfloat16, device="cuda") try: with attention_backend("flash"): output = dispatch_attention_fn(query, key, value) print("✓ Flash Attention works with checks enabled") except Exception as e: print(f"✗ Flash Attention failed: {e}") ``` You can also configure the registry directly. ```py from diffusers.models.attention_dispatch import _AttentionBackendRegistry _AttentionBackendRegistry._checks_enabled = True ``` ## Available backends Refer to the table below for a complete list of available attention backends and their variants.
Expand | Backend Name | Family | Description | |--------------|--------|-------------| | `native` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Default backend using PyTorch's scaled_dot_product_attention | | `flex` | [FlexAttention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention) | PyTorch FlexAttention implementation | | `_native_cudnn` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | CuDNN-optimized attention | | `_native_efficient` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Memory-efficient attention | | `_native_flash` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | PyTorch's FlashAttention | | `_native_math` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | Math-based attention (fallback) | | `_native_npu` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | NPU-optimized attention | | `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention | | `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 | | `flash_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 from kernels | | `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention | | `flash_varlen_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention from kernels | | `aiter` | [AI Tensor Engine for ROCm](https://github.com/ROCm/aiter) | FlashAttention for AMD ROCm | | `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 | | `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 | | `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels | | `_flash_3_varlen_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 from kernels | | `sage` | [SageAttention](https://github.com/thu-ml/SageAttention) | Quantized attention (INT8 QK) | | `sage_hub` | [SageAttention](https://github.com/thu-ml/SageAttention) | Quantized attention (INT8 QK) from kernels | | `sage_varlen` | [SageAttention](https://github.com/thu-ml/SageAttention) | Variable length SageAttention | | `_sage_qk_int8_pv_fp8_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (CUDA) | | `_sage_qk_int8_pv_fp8_cuda_sm90` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP8 PV (SM90) | | `_sage_qk_int8_pv_fp16_cuda` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (CUDA) | | `_sage_qk_int8_pv_fp16_triton` | [SageAttention](https://github.com/thu-ml/SageAttention) | INT8 QK + FP16 PV (Triton) | | `xformers` | [xFormers](https://github.com/facebookresearch/xformers) | Memory-efficient attention |
================================================ FILE: docs/source/en/optimization/cache.md ================================================ # Caching Caching accelerates inference by storing and reusing intermediate outputs of different layers, such as attention and feedforward layers, instead of performing the entire computation at each inference step. It significantly improves generation speed at the expense of more memory and doesn't require additional training. This guide shows you how to use the caching methods supported in Diffusers. ## Pyramid Attention Broadcast [Pyramid Attention Broadcast (PAB)](https://huggingface.co/papers/2408.12588) is based on the observation that attention outputs aren't that different between successive timesteps of the generation process. The attention differences are smallest in the cross attention layers and are generally cached over a longer timestep range. This is followed by temporal attention and spatial attention layers. > [!TIP] > Not all video models have three types of attention (cross, temporal, and spatial)! PAB can be combined with other techniques like sequence parallelism and classifier-free guidance parallelism (data parallelism) for near real-time video generation. Set up and pass a [`PyramidAttentionBroadcastConfig`] to a pipeline's transformer to enable it. The `spatial_attention_block_skip_range` controls how often to skip attention calculations in the spatial attention blocks and the `spatial_attention_timestep_skip_range` is the range of timesteps to skip. Take care to choose an appropriate range because a smaller interval can lead to slower inference speeds and a larger interval can result in lower generation quality. ```python import torch from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) pipeline.to("cuda") config = PyramidAttentionBroadcastConfig( spatial_attention_block_skip_range=2, spatial_attention_timestep_skip_range=(100, 800), current_timestep_callback=lambda: pipe.current_timestep, ) pipeline.transformer.enable_cache(config) ``` ## FasterCache [FasterCache](https://huggingface.co/papers/2410.19355) caches and reuses attention features similar to [PAB](#pyramid-attention-broadcast) since output differences are small for each successive timestep. This method may also choose to skip the unconditional branch prediction, when using classifier-free guidance for sampling (common in most base models), and estimate it from the conditional branch prediction if there is significant redundancy in the predicted latent outputs between successive timesteps. Set up and pass a [`FasterCacheConfig`] to a pipeline's transformer to enable it. ```python import torch from diffusers import CogVideoXPipeline, FasterCacheConfig pipe line= CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) pipeline.to("cuda") config = FasterCacheConfig( spatial_attention_block_skip_range=2, spatial_attention_timestep_skip_range=(-1, 681), current_timestep_callback=lambda: pipe.current_timestep, attention_weight_callback=lambda _: 0.3, unconditional_batch_skip_range=5, unconditional_batch_timestep_skip_range=(-1, 781), tensor_format="BFCHW", ) pipeline.transformer.enable_cache(config) ``` ## FirstBlockCache [FirstBlock Cache](https://huggingface.co/docs/diffusers/main/en/api/cache#diffusers.FirstBlockCacheConfig) checks how much the early layers of the denoiser changes from one timestep to the next. If the change is small, the model skips the expensive later layers and reuses the previous output. ```py import torch from diffusers import DiffusionPipeline from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16 ) apply_first_block_cache(pipeline.transformer, FirstBlockCacheConfig(threshold=0.2)) ``` ## TaylorSeer Cache [TaylorSeer Cache](https://huggingface.co/papers/2403.06923) accelerates diffusion inference by using Taylor series expansions to approximate and cache intermediate activations across denoising steps. The method predicts future outputs based on past computations, reusing them at specified intervals to reduce redundant calculations. This caching mechanism delivers strong results with minimal additional memory overhead. For detailed performance analysis, see [our findings here](https://github.com/huggingface/diffusers/pull/12648#issuecomment-3610615080). To enable TaylorSeer Cache, create a [`TaylorSeerCacheConfig`] and pass it to your pipeline's transformer: - `cache_interval`: Number of steps to reuse cached outputs before performing a full forward pass - `disable_cache_before_step`: Initial steps that use full computations to gather data for approximations - `max_order`: Approximation accuracy (in theory, higher values improve quality but increase memory usage but we recommend it should be set to `1`) ```python import torch from diffusers import FluxPipeline, TaylorSeerCacheConfig pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") config = TaylorSeerCacheConfig( cache_interval=5, max_order=1, disable_cache_before_step=10, taylor_factors_dtype=torch.bfloat16, ) pipe.transformer.enable_cache(config) ``` ## MagCache [MagCache](https://github.com/Zehong-Ma/MagCache) accelerates inference by skipping transformer blocks based on the magnitude of the residual update. It observes that the magnitude of updates (Output - Input) decays predictably over the diffusion process. By accumulating an "error budget" based on pre-computed magnitude ratios, it dynamically decides when to skip computation and reuse the previous residual. MagCache relies on **Magnitude Ratios** (`mag_ratios`), which describe this decay curve. These ratios are specific to the model checkpoint and scheduler. ### Usage To use MagCache, you typically follow a two-step process: **Calibration** and **Inference**. 1. **Calibration**: Run inference once with `calibrate=True`. The hook will measure the residual magnitudes and print the calculated ratios to the console. 2. **Inference**: Pass these ratios to `MagCacheConfig` to enable acceleration. ```python import torch from diffusers import FluxPipeline, MagCacheConfig pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16 ).to("cuda") # 1. Calibration Step # Run full inference to measure model behavior. calib_config = MagCacheConfig(calibrate=True, num_inference_steps=4) pipe.transformer.enable_cache(calib_config) # Run a prompt to trigger calibration pipe("A cat playing chess", num_inference_steps=4) # Logs will print something like: "MagCache Calibration Results: [1.0, 1.37, 0.97, 0.87]" # 2. Inference Step # Apply the specific ratios obtained from calibration for optimized speed. # Note: For Flux models, you can also import defaults: # from diffusers.hooks.mag_cache import FLUX_MAG_RATIOS mag_config = MagCacheConfig( mag_ratios=[1.0, 1.37, 0.97, 0.87], num_inference_steps=4 ) pipe.transformer.enable_cache(mag_config) image = pipe("A cat playing chess", num_inference_steps=4).images[0] ``` > [!NOTE] > `mag_ratios` represent the model's intrinsic magnitude decay curve. Ratios calibrated for a high number of steps (e.g., 50) can be reused for lower step counts (e.g., 20). The implementation uses interpolation to map the curve to the current number of inference steps. > [!TIP] > For pipelines that run Classifier-Free Guidance sequentially (like Kandinsky 5.0), the calibration log might print two arrays: one for the Conditional pass and one for the Unconditional pass. In most cases, you should use the first array (Conditional). > [!TIP] > For pipelines that run Classifier-Free Guidance in a **batched** manner (like SDXL or Flux), the `hidden_states` processed by the model contain both conditional and unconditional branches concatenated together. The calibration process automatically accounts for this, producing a single array of ratios that represents the joint behavior. You can use this resulting array directly without modification. ================================================ FILE: docs/source/en/optimization/cache_dit.md ================================================ ## CacheDiT CacheDiT is a unified, flexible, and training-free cache acceleration framework designed to support nearly all Diffusers' DiT-based pipelines. It provides a unified cache API that supports automatic block adapter, DBCache, and more. To learn more, refer to the [CacheDiT](https://github.com/vipshop/cache-dit) repository. Install a stable release of CacheDiT from PyPI or you can install the latest version from GitHub. ```bash pip3 install -U cache-dit ``` ```bash pip3 install git+https://github.com/vipshop/cache-dit.git ``` Run the command below to view supported DiT pipelines. ```python >>> import cache_dit >>> cache_dit.supported_pipelines() (30, ['Flux*', 'Mochi*', 'CogVideoX*', 'Wan*', 'HunyuanVideo*', 'QwenImage*', 'LTX*', 'Allegro*', 'CogView3Plus*', 'CogView4*', 'Cosmos*', 'EasyAnimate*', 'SkyReelsV2*', 'StableDiffusion3*', 'ConsisID*', 'DiT*', 'Amused*', 'Bria*', 'Lumina*', 'OmniGen*', 'PixArt*', 'Sana*', 'StableAudio*', 'VisualCloze*', 'AuraFlow*', 'Chroma*', 'ShapE*', 'HiDream*', 'HunyuanDiT*', 'HunyuanDiTPAG*']) ``` For a complete benchmark, please refer to [Benchmarks](https://github.com/vipshop/cache-dit/blob/main/bench/). ## Unified Cache API CacheDiT works by matching specific input/output patterns as shown below. ![](https://github.com/vipshop/cache-dit/raw/main/assets/patterns-v1.png) Call the `enable_cache()` function on a pipeline to enable cache acceleration. This function is the entry point to many of CacheDiT's features. ```python import cache_dit from diffusers import DiffusionPipeline # Can be any diffusion pipeline pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image") # One-line code with default cache options. cache_dit.enable_cache(pipe) # Just call the pipe as normal. output = pipe(...) # Disable cache and run original pipe. cache_dit.disable_cache(pipe) ``` ## Automatic Block Adapter For custom or modified pipelines or transformers not included in Diffusers, use the `BlockAdapter` in `auto` mode or via manual configuration. Please check the [BlockAdapter](https://github.com/vipshop/cache-dit/blob/main/docs/User_Guide.md#automatic-block-adapter) docs for more details. Refer to [Qwen-Image w/ BlockAdapter](https://github.com/vipshop/cache-dit/blob/main/examples/adapter/run_qwen_image_adapter.py) as an example. ```python from cache_dit import ForwardPattern, BlockAdapter # Use 🔥BlockAdapter with `auto` mode. cache_dit.enable_cache( BlockAdapter( # Any DiffusionPipeline, Qwen-Image, etc. pipe=pipe, auto=True, # Check `📚Forward Pattern Matching` documentation and hack the code of # of Qwen-Image, you will find that it has satisfied `FORWARD_PATTERN_1`. forward_pattern=ForwardPattern.Pattern_1, ), ) # Or, manually setup transformer configurations. cache_dit.enable_cache( BlockAdapter( pipe=pipe, # Qwen-Image, etc. transformer=pipe.transformer, blocks=pipe.transformer.transformer_blocks, forward_pattern=ForwardPattern.Pattern_1, ), ) ``` Sometimes, a Transformer class will contain more than one transformer `blocks`. For example, FLUX.1 (HiDream, Chroma, etc) contains `transformer_blocks` and `single_transformer_blocks` (with different forward patterns). The BlockAdapter is able to detect this hybrid pattern type as well. Refer to [FLUX.1](https://github.com/vipshop/cache-dit/blob/main/examples/adapter/run_flux_adapter.py) as an example. ```python # For diffusers <= 0.34.0, FLUX.1 transformer_blocks and # single_transformer_blocks have different forward patterns. cache_dit.enable_cache( BlockAdapter( pipe=pipe, # FLUX.1, etc. transformer=pipe.transformer, blocks=[ pipe.transformer.transformer_blocks, pipe.transformer.single_transformer_blocks, ], forward_pattern=[ ForwardPattern.Pattern_1, ForwardPattern.Pattern_3, ], ), ) ``` This also works if there is more than one transformer (namely `transformer` and `transformer_2`) in its structure. Refer to [Wan 2.2 MoE](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline/run_wan_2.2.py) as an example. ## Patch Functor For any pattern not included in CacheDiT, use the Patch Functor to convert the pattern into a known pattern. You need to subclass the Patch Functor and may also need to fuse the operations within the blocks for loop into block `forward`. After implementing a Patch Functor, set the `patch_functor` property in `BlockAdapter`. ![](https://github.com/vipshop/cache-dit/raw/main/assets/patch-functor.png) Some Patch Functors are already provided in CacheDiT, [HiDreamPatchFunctor](https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/cache_factory/patch_functors/functor_hidream.py), [ChromaPatchFunctor](https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/cache_factory/patch_functors/functor_chroma.py), etc. ```python @BlockAdapterRegistry.register("HiDream") def hidream_adapter(pipe, **kwargs) -> BlockAdapter: from diffusers import HiDreamImageTransformer2DModel from cache_dit.cache_factory.patch_functors import HiDreamPatchFunctor assert isinstance(pipe.transformer, HiDreamImageTransformer2DModel) return BlockAdapter( pipe=pipe, transformer=pipe.transformer, blocks=[ pipe.transformer.double_stream_blocks, pipe.transformer.single_stream_blocks, ], forward_pattern=[ ForwardPattern.Pattern_0, ForwardPattern.Pattern_3, ], # NOTE: Setup your custom patch functor here. patch_functor=HiDreamPatchFunctor(), **kwargs, ) ``` Finally, you can call the `cache_dit.summary()` function on a pipeline after its completed inference to get the cache acceleration details. ```python stats = cache_dit.summary(pipe) ``` ```python ⚡️Cache Steps and Residual Diffs Statistics: QwenImagePipeline | Cache Steps | Diffs Min | Diffs P25 | Diffs P50 | Diffs P75 | Diffs P95 | Diffs Max | |-------------|-----------|-----------|-----------|-----------|-----------|-----------| | 23 | 0.045 | 0.084 | 0.114 | 0.147 | 0.241 | 0.297 | ``` ## DBCache: Dual Block Cache ![](https://github.com/vipshop/cache-dit/raw/main/assets/dbcache-v1.png) DBCache (Dual Block Caching) supports different configurations of compute blocks (F8B12, etc.) to enable a balanced trade-off between performance and precision. - Fn_compute_blocks: Specifies that DBCache uses the **first n** Transformer blocks to fit the information at time step t, enabling the calculation of a more stable L1 diff and delivering more accurate information to subsequent blocks. - Bn_compute_blocks: Further fuses approximate information in the **last n** Transformer blocks to enhance prediction accuracy. These blocks act as an auto-scaler for approximate hidden states that use residual cache. ```python import cache_dit from diffusers import FluxPipeline pipe_or_adapter = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") # Default options, F8B0, 8 warmup steps, and unlimited cached # steps for good balance between performance and precision cache_dit.enable_cache(pipe_or_adapter) # Custom options, F8B8, higher precision from cache_dit import BasicCacheConfig cache_dit.enable_cache( pipe_or_adapter, cache_config=BasicCacheConfig( max_warmup_steps=8, # steps do not cache max_cached_steps=-1, # -1 means no limit Fn_compute_blocks=8, # Fn, F8, etc. Bn_compute_blocks=8, # Bn, B8, etc. residual_diff_threshold=0.12, ), ) ``` Check the [DBCache](https://github.com/vipshop/cache-dit/blob/main/docs/DBCache.md) and [User Guide](https://github.com/vipshop/cache-dit/blob/main/docs/User_Guide.md#dbcache) docs for more design details. ## TaylorSeer Calibrator The [TaylorSeers](https://huggingface.co/papers/2503.06923) algorithm further improves the precision of DBCache in cases where the cached steps are large (Hybrid TaylorSeer + DBCache). At timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, significantly harming the generation quality. TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. The TaylorSeer implemented in CacheDiT supports both hidden states and residual cache types. F_pred can be a residual cache or a hidden-state cache. ```python from cache_dit import BasicCacheConfig, TaylorSeerCalibratorConfig cache_dit.enable_cache( pipe_or_adapter, # Basic DBCache w/ FnBn configurations cache_config=BasicCacheConfig( max_warmup_steps=8, # steps do not cache max_cached_steps=-1, # -1 means no limit Fn_compute_blocks=8, # Fn, F8, etc. Bn_compute_blocks=8, # Bn, B8, etc. residual_diff_threshold=0.12, ), # Then, you can use the TaylorSeer Calibrator to approximate # the values in cached steps, taylorseer_order default is 1. calibrator_config=TaylorSeerCalibratorConfig( taylorseer_order=1, ), ) ``` > [!TIP] > The `Bn_compute_blocks` parameter of DBCache can be set to `0` if you use TaylorSeer as the calibrator for approximate hidden states. DBCache's `Bn_compute_blocks` also acts as a calibrator, so you can choose either `Bn_compute_blocks` > 0 or TaylorSeer. We recommend using the configuration scheme of TaylorSeer + DBCache FnB0. ## Hybrid Cache CFG CacheDiT supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG in the forward step, please set `enable_separate_cfg` parameter to `False (default, None)`. Otherwise, set it to `True`. ```python from cache_dit import BasicCacheConfig cache_dit.enable_cache( pipe_or_adapter, cache_config=BasicCacheConfig( ..., # For example, set it as True for Wan 2.1, Qwen-Image # and set it as False for FLUX.1, HunyuanVideo, etc. enable_separate_cfg=True, ), ) ``` ## torch.compile CacheDiT is designed to work with torch.compile for even better performance. Call `torch.compile` after enabling the cache. ```python cache_dit.enable_cache(pipe) # Compile the Transformer module pipe.transformer = torch.compile(pipe.transformer) ``` If you're using CacheDiT with dynamic input shapes, consider increasing the `recompile_limit` of `torch._dynamo`. Otherwise, the `recompile_limit` error may be triggered, causing the module to fall back to eager mode. ```python torch._dynamo.config.recompile_limit = 96 # default is 8 torch._dynamo.config.accumulated_recompile_limit = 2048 # default is 256 ``` Please check [perf.py](https://github.com/vipshop/cache-dit/blob/main/bench/perf.py) for more details. ================================================ FILE: docs/source/en/optimization/coreml.md ================================================ # How to run Stable Diffusion with Core ML [Core ML](https://developer.apple.com/documentation/coreml) is the model format and machine learning library supported by Apple frameworks. If you are interested in running Stable Diffusion models inside your macOS or iOS/iPadOS apps, this guide will show you how to convert existing PyTorch checkpoints into the Core ML format and use them for inference with Python or Swift. Core ML models can leverage all the compute engines available in Apple devices: the CPU, the GPU, and the Apple Neural Engine (or ANE, a tensor-optimized accelerator available in Apple Silicon Macs and modern iPhones/iPads). Depending on the model and the device it's running on, Core ML can mix and match compute engines too, so some portions of the model may run on the CPU while others run on GPU, for example. > [!TIP] > You can also run the `diffusers` Python codebase on Apple Silicon Macs using the `mps` accelerator built into PyTorch. This approach is explained in depth in [the mps guide](mps), but it is not compatible with native apps. ## Stable Diffusion Core ML Checkpoints Stable Diffusion weights (or checkpoints) are stored in the PyTorch format, so you need to convert them to the Core ML format before we can use them inside native apps. Thankfully, Apple engineers developed [a conversion tool](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) based on `diffusers` to convert the PyTorch checkpoints to Core ML. Before you convert a model, though, take a moment to explore the Hugging Face Hub – chances are the model you're interested in is already available in Core ML format: - the [Apple](https://huggingface.co/apple) organization includes Stable Diffusion versions 1.4, 1.5, 2.0 base, and 2.1 base - [coreml community](https://huggingface.co/coreml-community) includes custom finetuned models - use this [filter](https://huggingface.co/models?pipeline_tag=text-to-image&library=coreml&p=2&sort=likes) to return all available Core ML checkpoints If you can't find the model you're interested in, we recommend you follow the instructions for [Converting Models to Core ML](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml) by Apple. ## Selecting the Core ML Variant to Use Stable Diffusion models can be converted to different Core ML variants intended for different purposes: - The type of attention blocks used. The attention operation is used to "pay attention" to the relationship between different areas in the image representations and to understand how the image and text representations are related. Attention is compute- and memory-intensive, so different implementations exist that consider the hardware characteristics of different devices. For Core ML Stable Diffusion models, there are two attention variants: * `split_einsum` ([introduced by Apple](https://machinelearning.apple.com/research/neural-engine-transformers)) is optimized for ANE devices, which is available in modern iPhones, iPads and M-series computers. * The "original" attention (the base implementation used in `diffusers`) is only compatible with CPU/GPU and not ANE. It can be *faster* to run your model on CPU + GPU using `original` attention than ANE. See [this performance benchmark](https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks) as well as some [additional measures provided by the community](https://github.com/huggingface/swift-coreml-diffusers/issues/31) for additional details. - The supported inference framework. * `packages` are suitable for Python inference. This can be used to test converted Core ML models before attempting to integrate them inside native apps, or if you want to explore Core ML performance but don't need to support native apps. For example, an application with a web UI could perfectly use a Python Core ML backend. * `compiled` models are required for Swift code. The `compiled` models in the Hub split the large UNet model weights into several files for compatibility with iOS and iPadOS devices. This corresponds to the [`--chunk-unet` conversion option](https://github.com/apple/ml-stable-diffusion#-converting-models-to-core-ml). If you want to support native apps, then you need to select the `compiled` variant. The official Core ML Stable Diffusion [models](https://huggingface.co/apple/coreml-stable-diffusion-v1-4/tree/main) include these variants, but the community ones may vary: ``` coreml-stable-diffusion-v1-4 ├── README.md ├── original │ ├── compiled │ └── packages └── split_einsum ├── compiled └── packages ``` You can download and use the variant you need as shown below. ## Core ML Inference in Python Install the following libraries to run Core ML inference in Python: ```bash pip install huggingface_hub pip install git+https://github.com/apple/ml-stable-diffusion ``` ### Download the Model Checkpoints To run inference in Python, use one of the versions stored in the `packages` folders because the `compiled` ones are only compatible with Swift. You may choose whether you want to use `original` or `split_einsum` attention. This is how you'd download the `original` attention variant from the Hub to a directory called `models`: ```Python from huggingface_hub import snapshot_download from pathlib import Path repo_id = "apple/coreml-stable-diffusion-v1-4" variant = "original/packages" model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_")) snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False) print(f"Model downloaded at {model_path}") ``` ### Inference[[python-inference]] Once you have downloaded a snapshot of the model, you can test it using Apple's Python script. ```shell python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./models/coreml-stable-diffusion-v1-4_original_packages/original/packages -o --compute-unit CPU_AND_GPU --seed 93 ``` Pass the path of the downloaded checkpoint with `-i` flag to the script. `--compute-unit` indicates the hardware you want to allow for inference. It must be one of the following options: `ALL`, `CPU_AND_GPU`, `CPU_ONLY`, `CPU_AND_NE`. You may also provide an optional output path, and a seed for reproducibility. The inference script assumes you're using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`. If you use another model, you *have* to specify its Hub id in the inference command line, using the `--model-version` option. This works for models already supported and custom models you trained or fine-tuned yourself. For example, if you want to use [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5): ```shell python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version stable-diffusion-v1-5/stable-diffusion-v1-5 ``` ## Core ML inference in Swift Running inference in Swift is slightly faster than in Python because the models are already compiled in the `mlmodelc` format. This is noticeable on app startup when the model is loaded but shouldn’t be noticeable if you run several generations afterward. ### Download To run inference in Swift on your Mac, you need one of the `compiled` checkpoint versions. We recommend you download them locally using Python code similar to the previous example, but with one of the `compiled` variants: ```Python from huggingface_hub import snapshot_download from pathlib import Path repo_id = "apple/coreml-stable-diffusion-v1-4" variant = "original/compiled" model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_")) snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False) print(f"Model downloaded at {model_path}") ``` ### Inference[[swift-inference]] To run inference, please clone Apple's repo: ```bash git clone https://github.com/apple/ml-stable-diffusion cd ml-stable-diffusion ``` And then use Apple's command line tool, [Swift Package Manager](https://www.swift.org/package-manager/#): ```bash swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars" ``` You have to specify in `--resource-path` one of the checkpoints downloaded in the previous step, so please make sure it contains compiled Core ML bundles with the extension `.mlmodelc`. The `--compute-units` has to be one of these values: `all`, `cpuOnly`, `cpuAndGPU`, `cpuAndNeuralEngine`. For more details, please refer to the [instructions in Apple's repo](https://github.com/apple/ml-stable-diffusion). ## Supported Diffusers Features The Core ML models and inference code don't support many of the features, options, and flexibility of 🧨 Diffusers. These are some of the limitations to keep in mind: - Core ML models are only suitable for inference. They can't be used for training or fine-tuning. - Only two schedulers have been ported to Swift, the default one used by Stable Diffusion and `DPMSolverMultistepScheduler`, which we ported to Swift from our `diffusers` implementation. We recommend you use `DPMSolverMultistepScheduler`, since it produces the same quality in about half the steps. - Negative prompts, classifier-free guidance scale, and image-to-image tasks are available in the inference code. Advanced features such as depth guidance, ControlNet, and latent upscalers are not available yet. Apple's [conversion and inference repo](https://github.com/apple/ml-stable-diffusion) and our own [swift-coreml-diffusers](https://github.com/huggingface/swift-coreml-diffusers) repos are intended as technology demonstrators to enable other developers to build upon. If you feel strongly about any missing features, please feel free to open a feature request or, better yet, a contribution PR 🙂. ## Native Diffusers Swift app One easy way to run Stable Diffusion on your own Apple hardware is to use [our open-source Swift repo](https://github.com/huggingface/swift-coreml-diffusers), based on `diffusers` and Apple's conversion and inference repo. You can study the code, compile it with [Xcode](https://developer.apple.com/xcode/) and adapt it for your own needs. For your convenience, there's also a [standalone Mac app in the App Store](https://apps.apple.com/app/diffusers/id1666309574), so you can play with it without having to deal with the code or IDE. If you are a developer and have determined that Core ML is the best solution to build your Stable Diffusion app, then you can use the rest of this guide to get started with your project. We can't wait to see what you'll build 🙂. ================================================ FILE: docs/source/en/optimization/deepcache.md ================================================ # DeepCache [DeepCache](https://huggingface.co/papers/2312.00858) accelerates [`StableDiffusionPipeline`] and [`StableDiffusionXLPipeline`] by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture. Start by installing [DeepCache](https://github.com/horseee/DeepCache): ```bash pip install DeepCache ``` Then load and enable the [`DeepCacheSDHelper`](https://github.com/horseee/DeepCache#usage): ```diff import torch from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained('stable-diffusion-v1-5/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda") + from DeepCache import DeepCacheSDHelper + helper = DeepCacheSDHelper(pipe=pipe) + helper.set_params( + cache_interval=3, + cache_branch_id=0, + ) + helper.enable() image = pipe("a photo of an astronaut on a moon").images[0] ``` The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`. `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation. `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes. Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the [paper](https://huggingface.co/papers/2312.00858)). Once those arguments are set, use the `enable` or `disable` methods to activate or deactivate the `DeepCacheSDHelper`.
You can find more generated samples (original pipeline vs DeepCache) and the corresponding inference latency in the [WandB report](https://wandb.ai/horseee/DeepCache/runs/jwlsqqgt?workspace=user-horseee). The prompts are randomly selected from the [MS-COCO 2017](https://cocodataset.org/#home) dataset. ## Benchmark We tested how much faster DeepCache accelerates [Stable Diffusion v2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with 50 inference steps on an NVIDIA RTX A5000, using different configurations for resolution, batch size, cache interval (I), and cache branch (B). | **Resolution** | **Batch size** | **Original** | **DeepCache(I=3, B=0)** | **DeepCache(I=5, B=0)** | **DeepCache(I=5, B=1)** | |----------------|----------------|--------------|-------------------------|-------------------------|-------------------------| | 512| 8| 15.96| 6.88(2.32x)| 5.03(3.18x)| 7.27(2.20x)| | | 4| 8.39| 3.60(2.33x)| 2.62(3.21x)| 3.75(2.24x)| | | 1| 2.61| 1.12(2.33x)| 0.81(3.24x)| 1.11(2.35x)| | 768| 8| 43.58| 18.99(2.29x)| 13.96(3.12x)| 21.27(2.05x)| | | 4| 22.24| 9.67(2.30x)| 7.10(3.13x)| 10.74(2.07x)| | | 1| 6.33| 2.72(2.33x)| 1.97(3.21x)| 2.98(2.12x)| | 1024| 8| 101.95| 45.57(2.24x)| 33.72(3.02x)| 53.00(1.92x)| | | 4| 49.25| 21.86(2.25x)| 16.19(3.04x)| 25.78(1.91x)| | | 1| 13.83| 6.07(2.28x)| 4.43(3.12x)| 7.15(1.93x)| ================================================ FILE: docs/source/en/optimization/fp16.md ================================================ # Accelerate inference Diffusion models are slow at inference because generation is an iterative process where noise is gradually refined into an image or video over a certain number of "steps". To speedup this process, you can try experimenting with different [schedulers](../api/schedulers/overview), reduce the precision of the model weights for faster computations, use more memory-efficient attention mechanisms, and more. Combine and use these techniques together to make inference faster than using any single technique on its own. This guide will go over how to accelerate inference. ## Model data type The precision and data type of the model weights affect inference speed because a higher precision requires more memory to load and more time to perform the computations. PyTorch loads model weights in float32 or full precision by default, so changing the data type is a simple way to quickly get faster inference. bfloat16 is similar to float16 but it is more robust to numerical errors. Hardware support for bfloat16 varies, but most modern GPUs are capable of supporting bfloat16. ```py import torch from diffusers import StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, num_inference_steps=30).images[0] ``` float16 is similar to bfloat16 but may be more prone to numerical errors. ```py import torch from diffusers import StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, num_inference_steps=30).images[0] ``` [TensorFloat-32 (tf32)](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) mode is supported on NVIDIA Ampere GPUs and it computes the convolution and matrix multiplication operations in tf32. Storage and other operations are kept in float32. This enables significantly faster computations when combined with bfloat16 or float16. PyTorch only enables tf32 mode for convolutions by default and you'll need to explicitly enable it for matrix multiplications. ```py import torch from diffusers import StableDiffusionXLPipeline torch.backends.cuda.matmul.allow_tf32 = True pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, num_inference_steps=30).images[0] ``` Refer to the [mixed precision training](https://huggingface.co/docs/transformers/en/perf_train_gpu_one#mixed-precision) docs for more details. ## Scaled dot product attention > [!TIP] > Memory-efficient attention optimizes for inference speed *and* [memory usage](./memory#memory-efficient-attention)! [Scaled dot product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) implements several attention backends, [FlashAttention](https://github.com/Dao-AILab/flash-attention), [xFormers](https://github.com/facebookresearch/xformers), and a native C++ implementation. It automatically selects the most optimal backend for your hardware. SDPA is enabled by default if you're using PyTorch >= 2.0 and no additional changes are required to your code. You could try experimenting with other attention backends though if you'd like to choose your own. The example below uses the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager to enable efficient attention. ```py from torch.nn.attention import SDPBackend, sdpa_kernel import torch from diffusers import StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): image = pipeline(prompt, num_inference_steps=30).images[0] ``` ## torch.compile [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) accelerates inference by compiling PyTorch code and operations into optimized kernels. Diffusers typically compiles the more compute-intensive models like the UNet, transformer, or VAE. Enable the following compiler settings for maximum speed (refer to the [full list](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py) for more options). ```py import torch from diffusers import StableDiffusionXLPipeline torch._inductor.config.conv_1x1_as_mm = True torch._inductor.config.coordinate_descent_tuning = True torch._inductor.config.epilogue_fusion = False torch._inductor.config.coordinate_descent_check_all_directions = True ``` Load and compile the UNet and VAE. There are several different modes you can choose from, but `"max-autotune"` optimizes for the fastest speed by compiling to a CUDA graph. CUDA graphs effectively reduces the overhead by launching multiple GPU operations through a single CPU operation. > [!TIP] > With PyTorch 2.3.1, you can control the caching behavior of torch.compile. This is particularly beneficial for compilation modes like `"max-autotune"` which performs a grid-search over several compilation flags to find the optimal configuration. Learn more in the [Compile Time Caching in torch.compile](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html) tutorial. Changing the memory layout to [channels_last](./memory#torchchannels_last) also optimizes memory and inference speed. ```py pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") pipeline.unet.to(memory_format=torch.channels_last) pipeline.vae.to(memory_format=torch.channels_last) pipeline.unet = torch.compile( pipeline.unet, mode="max-autotune", fullgraph=True ) pipeline.vae.decode = torch.compile( pipeline.vae.decode, mode="max-autotune", fullgraph=True ) prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, num_inference_steps=30).images[0] ``` Compilation is slow the first time, but once compiled, it is significantly faster. Try to only use the compiled pipeline on the same type of inference operations. Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient. ### Dynamic shape compilation > [!TIP] > Make sure to always use the nightly version of PyTorch for better support. `torch.compile` keeps track of input shapes and conditions, and if these are different, it recompiles the model. For example, if a model is compiled on a 1024x1024 resolution image and used on an image with a different resolution, it triggers recompilation. To avoid recompilation, add `dynamic=True` to try and generate a more dynamic kernel to avoid recompilation when conditions change. ```diff + torch.fx.experimental._config.use_duck_shape = False + pipeline.unet = torch.compile( pipeline.unet, fullgraph=True, dynamic=True ) ``` Specifying `use_duck_shape=False` instructs the compiler if it should use the same symbolic variable to represent input sizes that are the same. For more details, check out this [comment](https://github.com/huggingface/diffusers/pull/11327#discussion_r2047659790). Not all models may benefit from dynamic compilation out of the box and may require changes. Refer to this [PR](https://github.com/huggingface/diffusers/pull/11297/) that improved the [`AuraFlowPipeline`] implementation to benefit from dynamic compilation. Feel free to open an issue if dynamic compilation doesn't work as expected for a Diffusers model. ### Regional compilation [Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) trims cold-start latency by only compiling the *small and frequently-repeated block(s)* of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Use the [`~ModelMixin.compile_repeated_blocks`] method, a helper that wraps `torch.compile`, on any component such as the transformer model as shown below. ```py # pip install -U diffusers import torch from diffusers import StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, ).to("cuda") # compile only the repeated transformer layers inside the UNet pipeline.unet.compile_repeated_blocks(fullgraph=True) ``` To enable regional compilation for a new model, add a `_repeated_blocks` attribute to a model class containing the class names (as strings) of the blocks you want to compile. ```py class MyUNet(ModelMixin): _repeated_blocks = ("Transformer2DModel",) # ← compiled by default ``` > [!TIP] > For more regional compilation examples, see the reference [PR](https://github.com/huggingface/diffusers/pull/11705). There is also a [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method in [Accelerate](https://huggingface.co/docs/accelerate/index) that automatically selects candidate blocks in a model to compile. The remaining graph is compiled separately. This is useful for quick experiments because there aren't as many options for you to set which blocks to compile or adjust compilation flags. ```py # pip install -U accelerate import torch from diffusers import StableDiffusionXLPipeline from accelerate.utils import compile_regions pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True) ``` [`~ModelMixin.compile_repeated_blocks`] is intentionally explicit. List the blocks to repeat in `_repeated_blocks` and the helper only compiles those blocks. It offers predictable behavior and easy reasoning about cache reuse in one line of code. ### Graph breaks It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables. ```diff - latents = unet( - latents, timestep=timestep, encoder_hidden_states=prompt_embeds -).sample + latents = unet( + latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False +)[0] ``` ### GPU sync The `step()` function is [called](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) on the scheduler each time after the denoiser makes a prediction, and the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476). When placed on the GPU, it introduces latency because of the communication sync between the CPU and GPU. It becomes more evident when the denoiser has already been compiled. In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240) to avoid the communication sync and latency. > [!TIP] > Refer to the [torch.compile and Diffusers: A Hands-On Guide to Peak Performance](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) blog post for maximizing performance with `torch.compile` for diffusion models. ### Benchmarks Refer to the [diffusers/benchmarks](https://huggingface.co/datasets/diffusers/benchmarks) dataset to see inference latency and memory usage data for compiled pipelines. The [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao#benchmarking-results) repository also contains benchmarking results for compiled versions of Flux and CogVideoX. ## Dynamic quantization [Dynamic quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) improves inference speed by reducing precision to enable faster math operations. This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor. As a result, the scaling factor is more accurately aligned with the data. The example below applies [dynamic int8 quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) to the UNet and VAE with the [torchao](../quantization/torchao) library. > [!TIP] > Refer to our [torchao](../quantization/torchao) docs to learn more about how to use the Diffusers torchao integration. Configure the compiler tags for maximum speed. ```py import torch from torchao import apply_dynamic_quant from diffusers import StableDiffusionXLPipeline torch._inductor.config.conv_1x1_as_mm = True torch._inductor.config.coordinate_descent_tuning = True torch._inductor.config.epilogue_fusion = False torch._inductor.config.coordinate_descent_check_all_directions = True torch._inductor.config.force_fuse_int_mm_with_mul = True torch._inductor.config.use_mixed_mm = True ``` Filter out some linear layers in the UNet and VAE which don't benefit from dynamic quantization with the [dynamic_quant_filter_fn](https://github.com/huggingface/diffusion-fast/blob/0f169640b1db106fe6a479f78c1ed3bfaeba3386/utils/pipeline_utils.py#L16). ```py pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16 ).to("cuda") apply_dynamic_quant(pipeline.unet, dynamic_quant_filter_fn) apply_dynamic_quant(pipeline.vae, dynamic_quant_filter_fn) prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, num_inference_steps=30).images[0] ``` ## Fused projection matrices > [!WARNING] > The [fuse_qkv_projections](https://github.com/huggingface/diffusers/blob/58431f102cf39c3c8a569f32d71b2ea8caa461e1/src/diffusers/pipelines/pipeline_utils.py#L2034) method is experimental and support is limited to mostly Stable Diffusion pipelines. Take a look at this [PR](https://github.com/huggingface/diffusers/pull/6179) to learn more about how to enable it for other pipelines An input is projected into three subspaces, represented by the projection matrices Q, K, and V, in an attention block. These projections are typically calculated separately, but you can horizontally combine these into a single matrix and perform the projection in a single step. It increases the size of the matrix multiplications of the input projections and also improves the impact of quantization. ```py pipeline.fuse_qkv_projections() ``` ## Resources - Read the [Presenting Flux Fast: Making Flux go brrr on H100s](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/) blog post to learn more about how you can combine all of these optimizations with [TorchInductor](https://docs.pytorch.org/docs/stable/torch.compiler.html) and [AOTInductor](https://docs.pytorch.org/docs/stable/torch.compiler_aot_inductor.html) for a ~2.5x speedup using recipes from [flux-fast](https://github.com/huggingface/flux-fast). These recipes support AMD hardware and [Flux.1 Kontext Dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev). - Read the [torch.compile and Diffusers: A Hands-On Guide to Peak Performance](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) blog post to maximize performance when using `torch.compile`. ================================================ FILE: docs/source/en/optimization/habana.md ================================================ # Intel Gaudi The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai/products/gaudi/), [Intel Gaudi 2](https://habana.ai/products/gaudi2/), and [Intel Gaudi 3](https://habana.ai/products/gaudi3/). Each server is equipped with 8 devices, known as Habana Processing Units (HPUs), providing 128GB of memory on Gaudi 3, 96GB on Gaudi 2, and 32GB on the first-gen Gaudi. For more details on the underlying hardware architecture, check out the [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html) overview. Diffusers pipelines can take advantage of HPU acceleration, even if a pipeline hasn't been added to [Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/main/en/habana/index) yet, with the [GPU Migration Toolkit](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html). Call `.to("hpu")` on your pipeline to move it to a HPU device as shown below for Flux: ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) pipeline.to("hpu") image = pipeline("An image of a squirrel in Picasso style").images[0] ``` > [!TIP] > For Gaudi-optimized diffusion pipeline implementations, we recommend using [Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/main/en/habana/index). ================================================ FILE: docs/source/en/optimization/memory.md ================================================ # Reduce memory usage Modern diffusion models like [Flux](../api/pipelines/flux) and [Wan](../api/pipelines/wan) have billions of parameters that take up a lot of memory on your hardware for inference. This is challenging because common GPUs often don't have sufficient memory. To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more. This guide will show you how to reduce your memory usage. > [!TIP] > Keep in mind these techniques may need to be adjusted depending on the model. For example, a transformer-based diffusion model may not benefit equally from these memory optimizations as a UNet-based model. ## Multiple GPUs If you have access to more than one GPU, there a few options for efficiently loading and distributing a large model across your hardware. These features are supported by the [Accelerate](https://huggingface.co/docs/accelerate/index) library, so make sure it is installed first. ```bash pip install -U accelerate ``` ### Sharded checkpoints Loading large checkpoints in several shards in useful because the shards are loaded one at a time. This keeps memory usage low, only requiring enough memory for the model size and the largest shard size. We recommend sharding when the fp32 checkpoint is greater than 5GB. The default shard size is 5GB. Shard a checkpoint in [`~DiffusionPipeline.save_pretrained`] with the `max_shard_size` parameter. ```py from diffusers import AutoModel unet = AutoModel.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet" ) unet.save_pretrained("sdxl-unet-sharded", max_shard_size="5GB") ``` Now you can use the sharded checkpoint, instead of the regular checkpoint, to save memory. ```py import torch from diffusers import AutoModel, StableDiffusionXLPipeline unet = AutoModel.from_pretrained( "username/sdxl-unet-sharded", torch_dtype=torch.float16 ) pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16 ).to("cuda") ``` ### Device placement > [!WARNING] > Device placement is an experimental feature and the API may change. Only the `balanced` strategy is supported at the moment. We plan to support additional mapping strategies in the future. The `device_map` parameter controls how the model components in a pipeline or the layers in an individual model are distributed across devices. The `balanced` device placement strategy evenly splits the pipeline across all available devices. ```py import torch from diffusers import AutoModel, StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, device_map="balanced" ) ``` You can inspect a pipeline's device map with `hf_device_map`. ```py print(pipeline.hf_device_map) {'unet': 1, 'vae': 1, 'safety_checker': 0, 'text_encoder': 0} ``` The `device_map` is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters. Set it to `"auto"` to automatically distribute a model across the fastest device first before moving to slower devices. Refer to the [Model sharding](../training/distributed_inference#model-sharding) docs for more details. ```py import torch from diffusers import AutoModel transformer = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", device_map="auto", torch_dtype=torch.bfloat16 ) ``` You can inspect a model's device map with `hf_device_map`. ```py print(transformer.hf_device_map) ``` When designing your own `device_map`, it should be a dictionary of a model's specific module name or layer and a device identifier (an integer for GPUs, `cpu` for CPUs, and `disk` for disk). Call `hf_device_map` on a model to see how model layers are distributed and then design your own. ```py print(transformer.hf_device_map) {'pos_embed': 0, 'time_text_embed': 0, 'context_embedder': 0, 'x_embedder': 0, 'transformer_blocks': 0, 'single_transformer_blocks.0': 0, 'single_transformer_blocks.1': 0, 'single_transformer_blocks.2': 0, 'single_transformer_blocks.3': 0, 'single_transformer_blocks.4': 0, 'single_transformer_blocks.5': 0, 'single_transformer_blocks.6': 0, 'single_transformer_blocks.7': 0, 'single_transformer_blocks.8': 0, 'single_transformer_blocks.9': 0, 'single_transformer_blocks.10': 'cpu', 'single_transformer_blocks.11': 'cpu', 'single_transformer_blocks.12': 'cpu', 'single_transformer_blocks.13': 'cpu', 'single_transformer_blocks.14': 'cpu', 'single_transformer_blocks.15': 'cpu', 'single_transformer_blocks.16': 'cpu', 'single_transformer_blocks.17': 'cpu', 'single_transformer_blocks.18': 'cpu', 'single_transformer_blocks.19': 'cpu', 'single_transformer_blocks.20': 'cpu', 'single_transformer_blocks.21': 'cpu', 'single_transformer_blocks.22': 'cpu', 'single_transformer_blocks.23': 'cpu', 'single_transformer_blocks.24': 'cpu', 'single_transformer_blocks.25': 'cpu', 'single_transformer_blocks.26': 'cpu', 'single_transformer_blocks.27': 'cpu', 'single_transformer_blocks.28': 'cpu', 'single_transformer_blocks.29': 'cpu', 'single_transformer_blocks.30': 'cpu', 'single_transformer_blocks.31': 'cpu', 'single_transformer_blocks.32': 'cpu', 'single_transformer_blocks.33': 'cpu', 'single_transformer_blocks.34': 'cpu', 'single_transformer_blocks.35': 'cpu', 'single_transformer_blocks.36': 'cpu', 'single_transformer_blocks.37': 'cpu', 'norm_out': 'cpu', 'proj_out': 'cpu'} ``` For example, the `device_map` below places `single_transformer_blocks.10` through `single_transformer_blocks.20` on a second GPU (`1`). ```py import torch from diffusers import AutoModel device_map = { 'pos_embed': 0, 'time_text_embed': 0, 'context_embedder': 0, 'x_embedder': 0, 'transformer_blocks': 0, 'single_transformer_blocks.0': 0, 'single_transformer_blocks.1': 0, 'single_transformer_blocks.2': 0, 'single_transformer_blocks.3': 0, 'single_transformer_blocks.4': 0, 'single_transformer_blocks.5': 0, 'single_transformer_blocks.6': 0, 'single_transformer_blocks.7': 0, 'single_transformer_blocks.8': 0, 'single_transformer_blocks.9': 0, 'single_transformer_blocks.10': 1, 'single_transformer_blocks.11': 1, 'single_transformer_blocks.12': 1, 'single_transformer_blocks.13': 1, 'single_transformer_blocks.14': 1, 'single_transformer_blocks.15': 1, 'single_transformer_blocks.16': 1, 'single_transformer_blocks.17': 1, 'single_transformer_blocks.18': 1, 'single_transformer_blocks.19': 1, 'single_transformer_blocks.20': 1, 'single_transformer_blocks.21': 'cpu', 'single_transformer_blocks.22': 'cpu', 'single_transformer_blocks.23': 'cpu', 'single_transformer_blocks.24': 'cpu', 'single_transformer_blocks.25': 'cpu', 'single_transformer_blocks.26': 'cpu', 'single_transformer_blocks.27': 'cpu', 'single_transformer_blocks.28': 'cpu', 'single_transformer_blocks.29': 'cpu', 'single_transformer_blocks.30': 'cpu', 'single_transformer_blocks.31': 'cpu', 'single_transformer_blocks.32': 'cpu', 'single_transformer_blocks.33': 'cpu', 'single_transformer_blocks.34': 'cpu', 'single_transformer_blocks.35': 'cpu', 'single_transformer_blocks.36': 'cpu', 'single_transformer_blocks.37': 'cpu', 'norm_out': 'cpu', 'proj_out': 'cpu' } transformer = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", device_map=device_map, torch_dtype=torch.bfloat16 ) ``` Pass a dictionary mapping maximum memory usage to each device to enforce a limit. If a device is not in `max_memory`, it is ignored and pipeline components won't be distributed to it. ```py import torch from diffusers import AutoModel, StableDiffusionXLPipeline max_memory = {0:"1GB", 1:"1GB"} pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, device_map="balanced", max_memory=max_memory ) ``` Diffusers uses the maxmium memory of all devices by default, but if they don't fit on the GPUs, then you'll need to use a single GPU and offload to the CPU with the methods below. - [`~DiffusionPipeline.enable_model_cpu_offload`] only works on a single GPU but a very large model may not fit on it - [`~DiffusionPipeline.enable_sequential_cpu_offload`] may work but it is extremely slow and also limited to a single GPU Use the [`~DiffusionPipeline.reset_device_map`] method to reset the `device_map`. This is necessary if you want to use methods like `.to()`, [`~DiffusionPipeline.enable_sequential_cpu_offload`], and [`~DiffusionPipeline.enable_model_cpu_offload`] on a pipeline that was device-mapped. ```py pipeline.reset_device_map() ``` ## VAE slicing VAE slicing saves memory by splitting large batches of inputs into a single batch of data and separately processing them. This method works best when generating more than one image at a time. For example, if you're generating 4 images at once, decoding would increase peak activation memory by 4x. VAE slicing reduces this by only decoding 1 image at a time instead of all 4 images at once. Call [`~StableDiffusionPipeline.enable_vae_slicing`] to enable sliced VAE. You can expect a small increase in performance when decoding multi-image batches and no performance impact for single-image batches. ```py import torch from diffusers import AutoModel, StableDiffusionXLPipeline pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, ).to("cuda") pipeline.enable_vae_slicing() pipeline(["An astronaut riding a horse on Mars"]*32).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` > [!WARNING] > The [`AutoencoderKLWan`] and [`AsymmetricAutoencoderKL`] classes don't support slicing. ## VAE tiling VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once. This also reduces peak memory usage because the GPU is only processing a tile at a time. Call [`~StableDiffusionPipeline.enable_vae_tiling`] to enable VAE tiling. The generated image may have some tone variation from tile-to-tile because they're decoded separately, but there shouldn't be any obvious seams between the tiles. Tiling is disabled for resolutions lower than a pre-specified (but configurable) limit. For example, this limit is 512x512 for the VAE in [`StableDiffusionPipeline`]. ```py import torch from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") pipeline.enable_vae_tiling() init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" pipeline(prompt, image=init_image, strength=0.5).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` > [!WARNING] > [`AutoencoderKLWan`] and [`AsymmetricAutoencoderKL`] don't support tiling. ## Offloading Offloading strategies move not currently active layers or models to the CPU to avoid increasing GPU memory. These strategies can be combined with quantization and torch.compile to balance inference speed and memory usage. Refer to the [Compile and offloading quantized models](./speed-memory-optims) guide for more details. ### CPU offloading CPU offloading selectively moves weights from the GPU to the CPU. When a component is required, it is transferred to the GPU and when it isn't required, it is moved to the CPU. This method works on submodules rather than whole models. It saves memory by avoiding storing the entire model on the GPU. CPU offloading dramatically reduces memory usage, but it is also **extremely slow** because submodules are passed back and forth multiple times between devices. It can often be impractical due to how slow it is. > [!WARNING] > Don't move the pipeline to CUDA before calling [`~DiffusionPipeline.enable_sequential_cpu_offload`], otherwise the amount of memory saved is only minimal (refer to this [issue](https://github.com/huggingface/diffusers/issues/1934) for more details). This is a stateful operation that installs hooks on the model. Call [`~DiffusionPipeline.enable_sequential_cpu_offload`] to enable it on a pipeline. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16 ) pipeline.enable_sequential_cpu_offload() pipeline( prompt="An astronaut riding a horse on Mars", guidance_scale=0., height=768, width=1360, num_inference_steps=4, max_sequence_length=256, ).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` ### Model offloading Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large. > [!WARNING] > Keep in mind that if models are reused outside the pipeline after hookes have been installed (see [Removing Hooks](https://huggingface.co/docs/accelerate/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module) for more details), you need to run the entire pipeline and models in the expected order to properly offload them. This is a stateful operation that installs hooks on the model. Call [`~DiffusionPipeline.enable_model_cpu_offload`] to enable it on a pipeline. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16 ) pipeline.enable_model_cpu_offload() pipeline( prompt="An astronaut riding a horse on Mars", guidance_scale=0., height=768, width=1360, num_inference_steps=4, max_sequence_length=256, ).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` [`~DiffusionPipeline.enable_model_cpu_offload`] also helps when you're using the [`~StableDiffusionXLPipeline.encode_prompt`] method on its own to generate the text encoders hidden state. ### Group offloading Group offloading moves groups of internal layers ([torch.nn.ModuleList](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html) or [torch.nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)) to the CPU. It uses less memory than [model offloading](#model-offloading) and it is faster than [CPU offloading](#cpu-offloading) because it reduces communication overhead. > [!WARNING] > Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism. Enable group offloading by configuring the `offload_type` parameter to `block_level` or `leaf_level`. - `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements. - `leaf_level` offloads individual layers at the lowest level and is equivalent to [CPU offloading](#cpu-offloading). But it can be made faster if you use streams without giving up inference speed. Group offloading is supported for entire pipelines or individual models. Applying group offloading to the entire pipeline is the easiest option while selectively applying it to individual models gives users more flexibility to use different offloading techniques for different models. Call [`~DiffusionPipeline.enable_group_offload`] on a pipeline. ```py import torch from diffusers import CogVideoXPipeline from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video onload_device = torch.device("cuda") offload_device = torch.device("cpu") pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) pipeline.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True ) prompt = ( "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " "atmosphere of this unique musical performance." ) video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") export_to_video(video, "output.mp4", fps=8) ``` Call [`~ModelMixin.enable_group_offload`] on standard Diffusers model components that inherit from [`ModelMixin`]. For other model components that don't inherit from [`ModelMixin`], such as a generic [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), use [`~hooks.apply_group_offloading`] instead. ```py import torch from diffusers import CogVideoXPipeline from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video onload_device = torch.device("cuda") offload_device = torch.device("cpu") pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16) # Use the enable_group_offload method for Diffusers model implementations pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level") pipeline.vae.enable_group_offload(onload_device=onload_device, offload_type="leaf_level") # Use the apply_group_offloading method for other model components apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2) prompt = ( "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " "atmosphere of this unique musical performance." ) video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") export_to_video(video, "output.mp4", fps=8) ``` #### CUDA stream The `use_stream` parameter can be activated for CUDA devices that support asynchronous data transfer streams to reduce overall execution time compared to [CPU offloading](#cpu-offloading). It overlaps data transfer and computation by using layer prefetching. The next layer to be executed is loaded onto the GPU while the current layer is still being executed. It can increase CPU memory significantly so ensure you have 2x the amount of memory as the model size. Set `record_stream=True` for more of a speedup at the cost of slightly increased memory usage. Refer to the [torch.Tensor.record_stream](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) docs to learn more. > [!TIP] > When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors. This may not work on all implementations, so feel free to open an issue if you encounter any problems. If you're using `block_level` group offloading with `use_stream` enabled, the `num_blocks_per_group` parameter should be set to `1`, otherwise a warning will be raised. ```py pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True) ``` The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading. It is best for `leaf_level` offloading and when CPU memory is bottlenecked. Memory is saved by creating pinned tensors on the fly instead of pre-pinning them. However, this may increase overall execution time. #### Offloading to disk Group offloading can consume significant system memory depending on the model size. On systems with limited memory, try group offloading onto the disk as a secondary memory. Set the `offload_to_disk_path` argument in either [`~ModelMixin.enable_group_offload`] or [`~hooks.apply_group_offloading`] to offload the model to the disk. ```py pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", offload_to_disk_path="path/to/disk") apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2, offload_to_disk_path="path/to/disk") ``` Refer to these [two](https://github.com/huggingface/diffusers/pull/11682#issue-3129365363) [tables](https://github.com/huggingface/diffusers/pull/11682#issuecomment-2955715126) to compare the speed and memory trade-offs. ## Layerwise casting > [!TIP] > Combine layerwise casting with [group offloading](#group-offloading) for even more memory savings. Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation. Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality. > [!WARNING] > Layerwise casting may not work with all models if the forward implementation contains internal typecasting of weights. The current implementation of layerwise casting assumes the forward pass is independent of the weight precision and the input datatypes are always specified in `compute_dtype` (see [here](https://github.com/huggingface/transformers/blob/7f5077e53682ca855afc826162b204ebf809f1f9/src/transformers/models/t5/modeling_t5.py#L294-L299) for an incompatible implementation). > > Layerwise casting may also fail on custom modeling implementations with [PEFT](https://huggingface.co/docs/peft/index) layers. There are some checks available but they are not extensively tested or guaranteed to work in all cases. Call [`~ModelMixin.enable_layerwise_casting`] to set the storage and computation datatypes. ```py import torch from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel from diffusers.utils import export_to_video transformer = CogVideoXTransformer3DModel.from_pretrained( "THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16 ) transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16) pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", transformer=transformer, torch_dtype=torch.bfloat16 ).to("cuda") prompt = ( "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. " "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other " "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, " "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. " "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical " "atmosphere of this unique musical performance." ) video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") export_to_video(video, "output.mp4", fps=8) ``` The [`~hooks.apply_layerwise_casting`] method can also be used if you need more control and flexibility. It can be partially applied to model layers by calling it on specific internal modules. Use the `skip_modules_pattern` or `skip_modules_classes` parameters to specify modules to avoid, such as the normalization and modulation layers. ```python import torch from diffusers import CogVideoXTransformer3DModel from diffusers.hooks import apply_layerwise_casting transformer = CogVideoXTransformer3DModel.from_pretrained( "THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16 ) # skip the normalization layer apply_layerwise_casting( transformer, storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16, skip_modules_classes=["norm"], non_blocking=True, ) ``` ## torch.channels_last [torch.channels_last](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) flips how tensors are stored from `(batch size, channels, height, width)` to `(batch size, heigh, width, channels)`. This aligns the tensors with how the hardware sequentially accesses the tensors stored in memory and avoids skipping around in memory to access the pixel values. Not all operators currently support the channels-last format and may result in worst performance, but it is still worth trying. ```py print(pipeline.unet.conv_out.state_dict()["weight"].stride()) # (2880, 9, 3, 1) pipeline.unet.to(memory_format=torch.channels_last) # in-place operation print( pipeline.unet.conv_out.state_dict()["weight"].stride() ) # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works ``` ## Memory-efficient attention Diffusers supports multiple memory-efficient attention backends (FlashAttention, xFormers, SageAttention, and more) through [`~ModelMixin.set_attention_backend`]. Refer to the [Attention backends](./attention_backends) guide to learn how to switch between them. ================================================ FILE: docs/source/en/optimization/mps.md ================================================ # Metal Performance Shaders (MPS) > [!TIP] > Pipelines with a MPS badge indicate a model can take advantage of the MPS backend on Apple silicon devices for faster inference. Feel free to open a [Pull Request](https://github.com/huggingface/diffusers/compare) to add this badge to pipelines that are missing it. 🤗 Diffusers is compatible with Apple silicon (M1/M2 chips) using the PyTorch [`mps`](https://pytorch.org/docs/stable/notes/mps.html) device, which uses the Metal framework to leverage the GPU on MacOS devices. You'll need to have: - macOS computer with Apple silicon (M1/M2) hardware - macOS 12.6 or later (13.0 or later recommended) - arm64 version of Python - [PyTorch 2.0](https://pytorch.org/get-started/locally/) (recommended) or 1.13 (minimum version supported for `mps`) The `mps` backend uses PyTorch's `.to()` interface to move the Stable Diffusion pipeline on to your M1 or M2 device: ```python from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5") pipe = pipe.to("mps") # Recommended if your computer has < 64 GB of RAM pipe.enable_attention_slicing() prompt = "a photo of an astronaut riding a horse on mars" image = pipe(prompt).images[0] image ``` > [!WARNING] > The PyTorch [mps](https://pytorch.org/docs/stable/notes/mps.html) backend does not support NDArray sizes greater than `2**32`. Please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) if you encounter this problem so we can investigate. If you're using **PyTorch 1.13**, you need to "prime" the pipeline with an additional one-time pass through it. This is a temporary workaround for an issue where the first inference pass produces slightly different results than subsequent ones. You only need to do this pass once, and after just one inference step you can discard the result. ```diff from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5").to("mps") pipe.enable_attention_slicing() prompt = "a photo of an astronaut riding a horse on mars" # First-time "warmup" pass if PyTorch version is 1.13 + _ = pipe(prompt, num_inference_steps=1) # Results match those from the CPU device after the warmup pass. image = pipe(prompt).images[0] ``` ## Troubleshoot This section lists some common issues with using the `mps` backend and how to solve them. ### Attention slicing M1/M2 performance is very sensitive to memory pressure. When this occurs, the system automatically swaps if it needs to which significantly degrades performance. To prevent this from happening, we recommend *attention slicing* to reduce memory pressure during inference and prevent swapping. This is especially relevant if your computer has less than 64GB of system RAM, or if you generate images at non-standard resolutions larger than 512×512 pixels. Call the [`~DiffusionPipeline.enable_attention_slicing`] function on your pipeline: ```py from diffusers import DiffusionPipeline import torch pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("mps") pipeline.enable_attention_slicing() ``` Attention slicing performs the costly attention operation in multiple steps instead of all at once. It usually improves performance by ~20% in computers without universal memory, but we've observed *better performance* in most Apple silicon computers unless you have 64GB of RAM or more. ### Batch inference Generating multiple prompts in a batch can crash or fail to work reliably. If this is the case, try iterating instead of batching. ================================================ FILE: docs/source/en/optimization/neuron.md ================================================ # AWS Neuron Diffusers functionalities are available on [AWS Inf2 instances](https://aws.amazon.com/ec2/instance-types/inf2/), which are EC2 instances powered by [Neuron machine learning accelerators](https://aws.amazon.com/machine-learning/inferentia/). These instances aim to provide better compute performance (higher throughput, lower latency) with good cost-efficiency, making them good candidates for AWS users to deploy diffusion models to production. [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index) is the interface between Hugging Face libraries and AWS Accelerators, including AWS [Trainium](https://aws.amazon.com/machine-learning/trainium/) and AWS [Inferentia](https://aws.amazon.com/machine-learning/inferentia/). It supports many of the features in Diffusers with similar APIs, so it is easier to learn if you're already familiar with Diffusers. Once you have created an AWS Inf2 instance, install Optimum Neuron. ```bash python -m pip install --upgrade-strategy eager optimum[neuronx] ``` > [!TIP] > We provide pre-built [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI) and Optimum Neuron containers for Amazon SageMaker. It's recommended to correctly set up your environment. The example below demonstrates how to generate images with the Stable Diffusion XL model on an inf2.8xlarge instance (you can switch to cheaper inf2.xlarge instances once the model is compiled). To generate some images, use the [`~optimum.neuron.NeuronStableDiffusionXLPipeline`] class, which is similar to the [`StableDiffusionXLPipeline`] class in Diffusers. Unlike Diffusers, you need to compile models in the pipeline to the Neuron format, `.neuron`. Launch the following command to export the model to the `.neuron` format. ```bash optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \ --batch_size 1 \ --height 1024 `# height in pixels of generated image, eg. 768, 1024` \ --width 1024 `# width in pixels of generated image, eg. 768, 1024` \ --num_images_per_prompt 1 `# number of images to generate per prompt, defaults to 1` \ --auto_cast matmul `# cast only matrix multiplication operations` \ --auto_cast_type bf16 `# cast operations from FP32 to BF16` \ sd_neuron_xl/ ``` Now generate some images with the pre-compiled SDXL model. ```python >>> from optimum.neuron import NeuronStableDiffusionXLPipeline >>> stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl/") >>> prompt = "a pig with wings flying in floating US dollar banknotes in the air, skyscrapers behind, warm color palette, muted colors, detailed, 8k" >>> image = stable_diffusion_xl(prompt).images[0] ``` peggy generated by sdxl on inf2 Feel free to check out more guides and examples on different use cases from the Optimum Neuron [documentation](https://huggingface.co/docs/optimum-neuron/en/inference_tutorials/stable_diffusion#generate-images-with-stable-diffusion-models-on-aws-inferentia)! ================================================ FILE: docs/source/en/optimization/onnx.md ================================================ # ONNX Runtime 🤗 [Optimum](https://github.com/huggingface/optimum) provides a Stable Diffusion pipeline compatible with ONNX Runtime. You'll need to install 🤗 Optimum with the following command for ONNX Runtime support: ```bash pip install -q optimum["onnxruntime"] ``` This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with ONNX Runtime. ## Stable Diffusion To load and run inference, use the [`~optimum.onnxruntime.ORTStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`: ```python from optimum.onnxruntime import ORTStableDiffusionPipeline model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True) prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0] pipeline.save_pretrained("./onnx-stable-diffusion-v1-5") ``` > [!WARNING] > Generating multiple prompts in a batch seems to take too much memory. While we look into it, you may need to iterate instead of batching. To export the pipeline in the ONNX format offline and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: ```bash optimum-cli export onnx --model stable-diffusion-v1-5/stable-diffusion-v1-5 sd_v15_onnx/ ``` Then to perform inference (you don't have to specify `export=True` again): ```python from optimum.onnxruntime import ORTStableDiffusionPipeline model_id = "sd_v15_onnx" pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id) prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0] ```
You can find more examples in 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. ## Stable Diffusion XL To load and run inference with SDXL, use the [`~optimum.onnxruntime.ORTStableDiffusionXLPipeline`]: ```python from optimum.onnxruntime import ORTStableDiffusionXLPipeline model_id = "stabilityai/stable-diffusion-xl-base-1.0" pipeline = ORTStableDiffusionXLPipeline.from_pretrained(model_id) prompt = "sailing ship in storm by Leonardo da Vinci" image = pipeline(prompt).images[0] ``` To export the pipeline in the ONNX format and use it later for inference, use the [`optimum-cli export`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) command: ```bash optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/ ``` SDXL in the ONNX format is supported for text-to-image and image-to-image. ================================================ FILE: docs/source/en/optimization/open_vino.md ================================================ # OpenVINO 🤗 [Optimum](https://github.com/huggingface/optimum-intel) provides Stable Diffusion pipelines compatible with OpenVINO to perform inference on a variety of Intel processors (see the [full list](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) of supported devices). You'll need to install 🤗 Optimum Intel with the `--upgrade-strategy eager` option to ensure [`optimum-intel`](https://github.com/huggingface/optimum-intel) is using the latest version: ```bash pip install --upgrade-strategy eager optimum["openvino"] ``` This guide will show you how to use the Stable Diffusion and Stable Diffusion XL (SDXL) pipelines with OpenVINO. ## Stable Diffusion To load and run inference, use the [`~optimum.intel.OVStableDiffusionPipeline`]. If you want to load a PyTorch model and convert it to the OpenVINO format on-the-fly, set `export=True`: ```python from optimum.intel import OVStableDiffusionPipeline model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True) prompt = "sailing ship in storm by Rembrandt" image = pipeline(prompt).images[0] # Don't forget to save the exported model pipeline.save_pretrained("openvino-sd-v1-5") ``` To further speed-up inference, statically reshape the model. If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again. ```python # Define the shapes related to the inputs and desired outputs batch_size, num_images, height, width = 1, 1, 512, 512 # Statically reshape the model pipeline.reshape(batch_size, height, width, num_images) # Compile the model before inference pipeline.compile() image = pipeline( prompt, height=height, width=width, num_images_per_prompt=num_images, ).images[0] ```
You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion), and Stable Diffusion is supported for text-to-image, image-to-image, and inpainting. ## Stable Diffusion XL To load and run inference with SDXL, use the [`~optimum.intel.OVStableDiffusionXLPipeline`]: ```python from optimum.intel import OVStableDiffusionXLPipeline model_id = "stabilityai/stable-diffusion-xl-base-1.0" pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id) prompt = "sailing ship in storm by Rembrandt" image = pipeline(prompt).images[0] ``` To further speed-up inference, [statically reshape](#stable-diffusion) the model as shown in the Stable Diffusion section. You can find more examples in the 🤗 Optimum [documentation](https://huggingface.co/docs/optimum/intel/inference#stable-diffusion-xl), and running SDXL in OpenVINO is supported for text-to-image and image-to-image. ================================================ FILE: docs/source/en/optimization/para_attn.md ================================================ # ParaAttention
Large image and video generation models, such as [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) and [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo), can be an inference challenge for real-time applications and deployment because of their size. [ParaAttention](https://github.com/chengzeyi/ParaAttention) is a library that implements **context parallelism** and **first block cache**, and can be combined with other techniques (torch.compile, fp8 dynamic quantization), to accelerate inference. This guide will show you how to apply ParaAttention to FLUX.1-dev and HunyuanVideo on NVIDIA L20 GPUs. No optimizations are applied for our baseline benchmark, except for HunyuanVideo to avoid out-of-memory errors. Our baseline benchmark shows that FLUX.1-dev is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds, and HunyuanVideo is able to generate 129 frames at 720p resolution in 30 steps in 3675.71 seconds. > [!TIP] > For even faster inference with context parallelism, try using NVIDIA A100 or H100 GPUs (if available) with NVLink support, especially when there is a large number of GPUs. ## First Block Cache Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster. However, it is hard to decide when to reuse the cache to ensure quality generated images or videos. ParaAttention directly uses the **residual difference of the first transformer block output** to approximate the difference among model outputs. When the difference is small enough, the residual difference of previous inference steps is reused. In other words, the denoising step is skipped. This achieves a 2x speedup on FLUX.1-dev and HunyuanVideo inference with very good quality.
Cache in Diffusion Transformer
How AdaCache works, First Block Cache is a variant of it
To apply first block cache on FLUX.1-dev, call `apply_cache_on_pipe` as shown below. 0.08 is the default residual difference value for FLUX models. ```python import time import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.08) # Enable memory savings # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() begin = time.time() image = pipe( "A cat holding a sign that says hello world", num_inference_steps=28, ).images[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving image to flux.png") image.save("flux.png") ``` | Optimizations | Original | FBCache rdt=0.06 | FBCache rdt=0.08 | FBCache rdt=0.10 | FBCache rdt=0.12 | | - | - | - | - | - | - | | Preview | ![Original](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-original.png) | ![FBCache rdt=0.06](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.06.png) | ![FBCache rdt=0.08](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.08.png) | ![FBCache rdt=0.10](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.10.png) | ![FBCache rdt=0.12](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/para-attn/flux-fbc-0.12.png) | | Wall Time (s) | 26.36 | 21.83 | 17.01 | 16.00 | 13.78 | First Block Cache reduced the inference speed to 17.01 seconds compared to the baseline, or 1.55x faster, while maintaining nearly zero quality loss. To apply First Block Cache on HunyuanVideo, `apply_cache_on_pipe` as shown below. 0.06 is the default residual difference value for HunyuanVideo models. ```python import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe, residual_diff_threshold=0.6) pipe.vae.enable_tiling() begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=30, ).frames[0] end = time.time() print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) ``` HunyuanVideo without FBCache HunyuanVideo with FBCache First Block Cache reduced the inference speed to 2271.06 seconds compared to the baseline, or 1.62x faster, while maintaining nearly zero quality loss. ## fp8 quantization fp8 with dynamic quantization further speeds up inference and reduces memory usage. Both the activations and weights must be quantized in order to use the 8-bit [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/). Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model. The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy. Install [torchao](https://github.com/pytorch/ao/tree/main) with the command below. ```bash pip3 install -U torch torchao ``` [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance. Compilation can take a long time if it's the first time the model is called, but it is worth it once the model has been compiled. This example only quantizes the transformer model, but you can also quantize the text encoder to reduce memory usage even more. > [!TIP] > Dynamic quantization can significantly change the distribution of the model output, so you need to change the `residual_diff_threshold` to a larger value for it to take effect. ```python import time import torch from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe( pipe, residual_diff_threshold=0.12, # Use a larger value to make the cache take effect ) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() image = pipe( "A cat holding a sign that says hello world", num_inference_steps=28, ).images[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving image to flux.png") image.save("flux.png") ``` fp8 dynamic quantization and torch.compile reduced the inference speed to 7.56 seconds compared to the baseline, or 3.48x faster. ```python import time import torch from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload() # pipe.enable_sequential_cpu_offload() for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, ).frames[0] end = time.time() if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) ``` A NVIDIA L20 GPU only has 48GB memory and could face out-of-memory (OOM) errors after compilation and if `enable_model_cpu_offload` isn't called because HunyuanVideo has very large activation tensors when running with high resolution and large number of frames. For GPUs with less than 80GB of memory, you can try reducing the resolution and number of frames to avoid OOM errors. Large video generation models are usually bottlenecked by the attention computations rather than the fully connected layers. These models don't significantly benefit from quantization and torch.compile. ## Context Parallelism Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization. > [!TIP] > Refer to the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main) repository for detailed instructions and examples of how to scale inference with multiple GPUs. If the inference process needs to be persistent and serviceable, it is suggested to use [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html) to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model. The code sample below combines First Block Cache, fp8 dynamic quantization, torch.compile, and Context Parallelism for the fastest inference speed. ```python import time import torch import torch.distributed as dist from diffusers import FluxPipeline dist.init_process_group() torch.cuda.set_device(dist.get_rank()) pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, max_ring_dim_size=2, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe( pipe, residual_diff_threshold=0.12, # Use a larger value to make the cache take effect ) from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only quantize_(pipe.text_encoder, float8_weight_only()) quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) torch._inductor.config.reorder_for_compute_comm_overlap = True pipe.transformer = torch.compile( pipe.transformer, mode="max-autotune-no-cudagraphs", ) # Enable memory savings # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() image = pipe( "A cat holding a sign that says hello world", num_inference_steps=28, output_type="pil" if dist.get_rank() == 0 else "pt", ).images[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving image to flux.png") image.save("flux.png") dist.destroy_process_group() ``` Save to `run_flux.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html). ```bash # Use --nproc_per_node to specify the number of GPUs torchrun --nproc_per_node=2 run_flux.py ``` Inference speed is reduced to 8.20 seconds compared to the baseline, or 3.21x faster, with 2 NVIDIA L20 GPUs. On 4 L20s, inference speed is 3.90 seconds, or 6.75x faster. The code sample below combines First Block Cache and Context Parallelism for the fastest inference speed. ```python import time import torch import torch.distributed as dist from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel from diffusers.utils import export_to_video dist.init_process_group() torch.cuda.set_device(dist.get_rank()) model_id = "tencent/HunyuanVideo" transformer = HunyuanVideoTransformer3DModel.from_pretrained( model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18", ) pipe = HunyuanVideoPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18", ).to("cuda") from para_attn.context_parallel import init_context_parallel_mesh from para_attn.context_parallel.diffusers_adapters import parallelize_pipe from para_attn.parallel_vae.diffusers_adapters import parallelize_vae mesh = init_context_parallel_mesh( pipe.device.type, ) parallelize_pipe( pipe, mesh=mesh, ) parallelize_vae(pipe.vae, mesh=mesh._flatten()) from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe apply_cache_on_pipe(pipe) # from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only # # torch._inductor.config.reorder_for_compute_comm_overlap = True # # quantize_(pipe.text_encoder, float8_weight_only()) # quantize_(pipe.transformer, float8_dynamic_activation_float8_weight()) # pipe.transformer = torch.compile( # pipe.transformer, mode="max-autotune-no-cudagraphs", # ) # Enable memory savings pipe.vae.enable_tiling() # pipe.enable_model_cpu_offload(gpu_id=dist.get_rank()) # pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank()) for i in range(2): begin = time.time() output = pipe( prompt="A cat walks on the grass, realistic", height=720, width=1280, num_frames=129, num_inference_steps=1 if i == 0 else 30, output_type="pil" if dist.get_rank() == 0 else "pt", ).frames[0] end = time.time() if dist.get_rank() == 0: if i == 0: print(f"Warm up time: {end - begin:.2f}s") else: print(f"Time: {end - begin:.2f}s") if dist.get_rank() == 0: print("Saving video to hunyuan_video.mp4") export_to_video(output, "hunyuan_video.mp4", fps=15) dist.destroy_process_group() ``` Save to `run_hunyuan_video.py` and launch it with [torchrun](https://pytorch.org/docs/stable/elastic/run.html). ```bash # Use --nproc_per_node to specify the number of GPUs torchrun --nproc_per_node=8 run_hunyuan_video.py ``` Inference speed is reduced to 649.23 seconds compared to the baseline, or 5.66x faster, with 8 NVIDIA L20 GPUs. ## Benchmarks | GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup | | - | - | - | - | - | | NVIDIA L20 | 1 | Baseline | 26.36 | 1.00x | | NVIDIA L20 | 1 | FBCache (rdt=0.08) | 17.01 | 1.55x | | NVIDIA L20 | 1 | FP8 DQ | 13.40 | 1.96x | | NVIDIA L20 | 1 | FBCache (rdt=0.12) + FP8 DQ | 7.56 | 3.48x | | NVIDIA L20 | 2 | FBCache (rdt=0.12) + FP8 DQ + CP | 4.92 | 5.35x | | NVIDIA L20 | 4 | FBCache (rdt=0.12) + FP8 DQ + CP | 3.90 | 6.75x | | GPU Type | Number of GPUs | Optimizations | Wall Time (s) | Speedup | | - | - | - | - | - | | NVIDIA L20 | 1 | Baseline | 3675.71 | 1.00x | | NVIDIA L20 | 1 | FBCache | 2271.06 | 1.62x | | NVIDIA L20 | 2 | FBCache + CP | 1132.90 | 3.24x | | NVIDIA L20 | 4 | FBCache + CP | 718.15 | 5.12x | | NVIDIA L20 | 8 | FBCache + CP | 649.23 | 5.66x | ================================================ FILE: docs/source/en/optimization/pruna.md ================================================ # Pruna [Pruna](https://github.com/PrunaAI/pruna) is a model optimization framework that offers various optimization methods - quantization, pruning, caching, compilation - for accelerating inference and reducing memory usage. A general overview of the optimization methods are shown below. | Technique | Description | Speed | Memory | Quality | |--------------|-----------------------------------------------------------------------------------------------|:-----:|:------:|:-------:| | `batcher` | Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing processing time. | ✅ | ❌ | ➖ | | `cacher` | Stores intermediate results of computations to speed up subsequent operations. | ✅ | ➖ | ➖ | | `compiler` | Optimises the model with instructions for specific hardware. | ✅ | ➖ | ➖ | | `distiller` | Trains a smaller, simpler model to mimic a larger, more complex model. | ✅ | ✅ | ❌ | | `quantizer` | Reduces the precision of weights and activations, lowering memory requirements. | ✅ | ✅ | ❌ | | `pruner` | Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. | ✅ | ✅ | ❌ | | `recoverer` | Restores the performance of a model after compression. | ➖ | ➖ | ✅ | | `factorizer` | Factorization batches several small matrix multiplications into one large fused operation. | ✅ | ➖ | ➖ | | `enhancer` | Enhances the model output by applying post-processing algorithms such as denoising or upscaling. | ❌ | - | ✅ | ✅ (improves), ➖ (approx. the same), ❌ (worsens) Explore the full range of optimization methods in the [Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). ## Installation Install Pruna with the following command. ```bash pip install pruna ``` ## Optimize Diffusers models A broad range of optimization algorithms are supported for Diffusers models as shown below.
Overview of the supported optimization algorithms for diffusers models
The example below optimizes [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with a combination of factorizer, compiler, and cacher algorithms. This combination accelerates inference by up to 4.2x and cuts peak GPU memory usage from 34.7GB to 28.0GB, all while maintaining virtually the same output quality. > [!TIP] > Refer to the [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html) docs to learn more about the optimization techniques used in this example.
Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms
Start by defining a `SmashConfig` with the optimization algorithms to use. To optimize the model, wrap the pipeline and the `SmashConfig` with `smash` and then use the pipeline as normal for inference. ```python import torch from diffusers import FluxPipeline from pruna import PrunaModel, SmashConfig, smash # load the model # Try segmind/Segmind-Vega or black-forest-labs/FLUX.1-schnell with a small GPU memory pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") # define the configuration smash_config = SmashConfig() smash_config["factorizer"] = "qkv_diffusers" smash_config["compiler"] = "torch_compile" smash_config["torch_compile_target"] = "module_list" smash_config["cacher"] = "fora" smash_config["fora_interval"] = 2 # for the best results in terms of speed you can add these configs # however they will increase your warmup time from 1.5 min to 10 min # smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs" # smash_config["quantizer"] = "torchao" # smash_config["torchao_quant_type"] = "fp8dq" # smash_config["torchao_excluded_modules"] = "norm+embedding" # optimize the model smashed_pipe = smash(pipe, smash_config) # run the model smashed_pipe("a knitted purple prune").images[0] ```
After optimization, we can share and load the optimized model using the Hugging Face Hub. ```python # save the model smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") # load the model smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") ``` ## Evaluate and benchmark Diffusers models Pruna provides the [EvaluationAgent](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) to evaluate the quality of your optimized models. We can metrics we care about, such as total time and throughput, and the dataset to evaluate on. We can define a model and pass it to the `EvaluationAgent`. We can load and evaluate an optimized model by using the `EvaluationAgent` and pass it to the `Task`. ```python import torch from diffusers import FluxPipeline from pruna import PrunaModel from pruna.data.pruna_datamodule import PrunaDataModule from pruna.evaluation.evaluation_agent import EvaluationAgent from pruna.evaluation.metrics import ( ThroughputMetric, TorchMetricWrapper, TotalTimeMetric, ) from pruna.evaluation.task import Task # define the device device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" # load the model # Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") # Define the metrics metrics = [ TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), ThroughputMetric(n_iterations=20, n_warmup_iterations=5), TorchMetricWrapper("clip"), ] # Define the datamodule datamodule = PrunaDataModule.from_string("LAION256") datamodule.limit_datasets(10) # Define the task and evaluation agent task = Task(metrics, datamodule=datamodule, device=device) eval_agent = EvaluationAgent(task) # Evaluate smashed model and offload it to CPU smashed_pipe.move_to_device(device) smashed_pipe_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper and run the `EvaluationAgent` on it. ```python import torch from diffusers import FluxPipeline from pruna import PrunaModel # load the model # Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cpu") wrapped_pipe = PrunaModel(model=pipe) ``` Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. > [!TIP] > For more details about benchmarking Flux, check out the [Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) blog post and the [InferBench](https://huggingface.co/spaces/PrunaAI/InferBench) Space. ## Reference - [Pruna](https://github.com/pruna-ai/pruna) - [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) - [Pruna evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) - [Pruna tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) ================================================ FILE: docs/source/en/optimization/speed-memory-optims.md ================================================ # Compiling and offloading quantized models Optimizing models often involves trade-offs between [inference speed](./fp16) and [memory-usage](./memory). For instance, while [caching](./cache) can boost inference speed, it also increases memory consumption since it needs to store the outputs of intermediate attention layers. A more balanced optimization strategy combines quantizing a model, [torch.compile](./fp16#torchcompile) and various [offloading methods](./memory#offloading). > [!TIP] > Check the [torch.compile](./fp16#torchcompile) guide to learn more about compilation and how they can be applied here. For example, regional compilation can significantly reduce compilation time without giving up any speedups. For image generation, combining quantization and [model offloading](./memory#model-offloading) can often give the best trade-off between quality, speed, and memory. Group offloading is not as effective for image generation because it is usually not possible to *fully* overlap data transfer if the compute kernel finishes faster. This results in some communication overhead between the CPU and GPU. For video generation, combining quantization and [group-offloading](./memory#group-offloading) tends to be better because video models are more compute-bound. The table below provides a comparison of optimization strategy combinations and their impact on latency and memory-usage for Flux. | combination | latency (s) | memory-usage (GB) | |---|---|---| | quantization | 32.602 | 14.9453 | | quantization, torch.compile | 25.847 | 14.9448 | | quantization, torch.compile, model CPU offloading | 32.312 | 12.2369 | These results are benchmarked on Flux with a RTX 4090. The transformer and text_encoder components are quantized. Refer to the benchmarking script if you're interested in evaluating your own model. This guide will show you how to compile and offload a quantized model with [bitsandbytes](../quantization/bitsandbytes#torchcompile). Make sure you are using [PyTorch nightly](https://pytorch.org/get-started/locally/) and the latest version of bitsandbytes. ```bash pip install -U bitsandbytes ``` ## Quantization and torch.compile Start by [quantizing](../quantization/overview) a model to reduce the memory required for storage and [compiling](./fp16#torchcompile) it to accelerate inference. Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models. ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig torch._dynamo.config.capture_dynamic_output_shape_ops = True # quantize pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder_2"], ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") # compile pipeline.transformer.to(memory_format=torch.channels_last) pipeline.transformer.compile(mode="max-autotune", fullgraph=True) pipeline(""" cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ ).images[0] ``` ## Quantization, torch.compile, and offloading In addition to quantization and torch.compile, try offloading if you need to reduce memory-usage further. Offloading moves various layers or model components from the CPU to the GPU as needed for computations. Configure the [Dynamo](https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) `cache_size_limit` during offloading to avoid excessive recompilation and set `capture_dynamic_output_shape_ops = True` to handle dynamic outputs when compiling bitsandbytes models. [Model CPU offloading](./memory#model-offloading) moves an individual pipeline component, like the transformer model, to the GPU when it is needed for computation. Otherwise, it is offloaded to the CPU. ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig torch._dynamo.config.cache_size_limit = 1000 torch._dynamo.config.capture_dynamic_output_shape_ops = True # quantize pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder_2"], ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") # model CPU offloading pipeline.enable_model_cpu_offload() # compile pipeline.transformer.compile() pipeline( "cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain" ).images[0] ``` [Group offloading](./memory#group-offloading) moves the internal layers of an individual pipeline component, like the transformer model, to the GPU for computation and offloads it when it's not required. At the same time, it uses the [CUDA stream](./memory#cuda-stream) feature to prefetch the next layer for execution. By overlapping computation and data transfer, it is faster than model CPU offloading while also saving memory. ```py # pip install ftfy import torch from diffusers import AutoModel, DiffusionPipeline from diffusers.hooks import apply_group_offloading from diffusers.utils import export_to_video from diffusers.quantizers import PipelineQuantizationConfig from transformers import UMT5EncoderModel torch._dynamo.config.cache_size_limit = 1000 torch._dynamo.config.capture_dynamic_output_shape_ops = True # quantize pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder"], ) text_encoder = UMT5EncoderModel.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16 ) pipeline = DiffusionPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-14B-Diffusers", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") # group offloading onload_device = torch.device("cuda") offload_device = torch.device("cpu") pipeline.transformer.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) pipeline.vae.enable_group_offload( onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) apply_group_offloading( pipeline.text_encoder, onload_device=onload_device, offload_type="leaf_level", use_stream=True, non_blocking=True ) # compile pipeline.transformer.compile() prompt = """ The camera rushes from far to near in a low-angle shot, revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field. """ negative_prompt = """ Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards """ output = pipeline( prompt=prompt, negative_prompt=negative_prompt, num_frames=81, guidance_scale=5.0, ).frames[0] export_to_video(output, "output.mp4", fps=16) ``` ================================================ FILE: docs/source/en/optimization/tgate.md ================================================ # T-GATE [T-GATE](https://github.com/HaozheLiu-ST/T-GATE/tree/main) accelerates inference for [Stable Diffusion](../api/pipelines/stable_diffusion/overview), [PixArt](../api/pipelines/pixart), and [Latency Consistency Model](../api/pipelines/latent_consistency_models.md) pipelines by skipping the cross-attention calculation once it converges. This method doesn't require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like [DeepCache](./deepcache). Before you begin, make sure you install T-GATE. ```bash pip install tgate pip install -U torch diffusers transformers accelerate DeepCache ``` To use T-GATE with a pipeline, you need to use its corresponding loader. | Pipeline | T-GATE Loader | |---|---| | PixArt | TgatePixArtLoader | | Stable Diffusion XL | TgateSDXLLoader | | Stable Diffusion XL + DeepCache | TgateSDXLDeepCacheLoader | | Stable Diffusion | TgateSDLoader | | Stable Diffusion + DeepCache | TgateSDDeepCacheLoader | Next, create a `TgateLoader` with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the `tgate` method on the pipeline with a prompt, gate step, and the number of inference steps. Let's see how to enable this for several different pipelines. Accelerate `PixArtAlphaPipeline` with T-GATE: ```py import torch from diffusers import PixArtAlphaPipeline from tgate import TgatePixArtLoader pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16) gate_step = 8 inference_step = 25 pipe = TgatePixArtLoader( pipe, gate_step=gate_step, num_inference_steps=inference_step, ).to("cuda") image = pipe.tgate( "An alpaca made of colorful building blocks, cyberpunk.", gate_step=gate_step, num_inference_steps=inference_step, ).images[0] ``` Accelerate `StableDiffusionXLPipeline` with T-GATE: ```py import torch from diffusers import StableDiffusionXLPipeline from diffusers import DPMSolverMultistepScheduler from tgate import TgateSDXLLoader pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True, ) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) gate_step = 10 inference_step = 25 pipe = TgateSDXLLoader( pipe, gate_step=gate_step, num_inference_steps=inference_step, ).to("cuda") image = pipe.tgate( "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", gate_step=gate_step, num_inference_steps=inference_step ).images[0] ``` Accelerate `StableDiffusionXLPipeline` with [DeepCache](https://github.com/horseee/DeepCache) and T-GATE: ```py import torch from diffusers import StableDiffusionXLPipeline from diffusers import DPMSolverMultistepScheduler from tgate import TgateSDXLDeepCacheLoader pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True, ) pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) gate_step = 10 inference_step = 25 pipe = TgateSDXLDeepCacheLoader( pipe, cache_interval=3, cache_branch_id=0, ).to("cuda") image = pipe.tgate( "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", gate_step=gate_step, num_inference_steps=inference_step ).images[0] ``` Accelerate `latent-consistency/lcm-sdxl` with T-GATE: ```py import torch from diffusers import StableDiffusionXLPipeline from diffusers import UNet2DConditionModel, LCMScheduler from diffusers import DPMSolverMultistepScheduler from tgate import TgateSDXLLoader unet = UNet2DConditionModel.from_pretrained( "latent-consistency/lcm-sdxl", torch_dtype=torch.float16, variant="fp16", ) pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", ) pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) gate_step = 1 inference_step = 4 pipe = TgateSDXLLoader( pipe, gate_step=gate_step, num_inference_steps=inference_step, lcm=True ).to("cuda") image = pipe.tgate( "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.", gate_step=gate_step, num_inference_steps=inference_step ).images[0] ``` T-GATE also supports [`StableDiffusionPipeline`] and [PixArt-alpha/PixArt-LCM-XL-2-1024-MS](https://hf.co/PixArt-alpha/PixArt-LCM-XL-2-1024-MS). ## Benchmarks | Model | MACs | Param | Latency | Zero-shot 10K-FID on MS-COCO | |-----------------------|----------|-----------|---------|---------------------------| | SD-1.5 | 16.938T | 859.520M | 7.032s | 23.927 | | SD-1.5 w/ T-GATE | 9.875T | 815.557M | 4.313s | 20.789 | | SD-2.1 | 38.041T | 865.785M | 16.121s | 22.609 | | SD-2.1 w/ T-GATE | 22.208T | 815.433 M | 9.878s | 19.940 | | SD-XL | 149.438T | 2.570B | 53.187s | 24.628 | | SD-XL w/ T-GATE | 84.438T | 2.024B | 27.932s | 22.738 | | Pixart-Alpha | 107.031T | 611.350M | 61.502s | 38.669 | | Pixart-Alpha w/ T-GATE | 65.318T | 462.585M | 37.867s | 35.825 | | DeepCache (SD-XL) | 57.888T | - | 19.931s | 23.755 | | DeepCache w/ T-GATE | 43.868T | - | 14.666s | 23.999 | | LCM (SD-XL) | 11.955T | 2.570B | 3.805s | 25.044 | | LCM w/ T-GATE | 11.171T | 2.024B | 3.533s | 25.028 | | LCM (Pixart-Alpha) | 8.563T | 611.350M | 4.733s | 36.086 | | LCM w/ T-GATE | 7.623T | 462.585M | 4.543s | 37.048 | The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with [calflops](https://github.com/MrYxJ/calculate-flops.pytorch), and the FID is calculated with [PytorchFID](https://github.com/mseitzer/pytorch-fid). ================================================ FILE: docs/source/en/optimization/tome.md ================================================ # Token merging [Token merging](https://huggingface.co/papers/2303.17604) (ToMe) merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of [`StableDiffusionPipeline`]. Install ToMe from `pip`: ```bash pip install tomesd ``` You can use ToMe from the [`tomesd`](https://github.com/dbolya/tomesd) library with the [`apply_patch`](https://github.com/dbolya/tomesd?tab=readme-ov-file#usage) function: ```diff from diffusers import StableDiffusionPipeline import torch import tomesd pipeline = StableDiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True, ).to("cuda") + tomesd.apply_patch(pipeline, ratio=0.5) image = pipeline("a photo of an astronaut riding a horse on mars").images[0] ``` The `apply_patch` function exposes a number of [arguments](https://github.com/dbolya/tomesd#usage) to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass. As reported in the [paper](https://huggingface.co/papers/2303.17604), ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, you can speed-up inference even further, but at the cost of some degraded image quality. To test the quality of the generated images, we sampled a few prompts from [Parti Prompts](https://parti.research.google/) and performed inference with the [`StableDiffusionPipeline`] with the following settings:
We didn’t notice any significant decrease in the quality of the generated samples, and you can check out the generated samples in this [WandB report](https://wandb.ai/sayakpaul/tomesd-results/runs/23j4bj3i?workspace=). If you're interested in reproducing this experiment, use this [script](https://gist.github.com/sayakpaul/8cac98d7f22399085a060992f411ecbd). ## Benchmarks We also benchmarked the impact of `tomesd` on the [`StableDiffusionPipeline`] with [xFormers](https://huggingface.co/docs/diffusers/optimization/xformers) enabled across several image resolutions. The results are obtained from A100 and V100 GPUs in the following development environment: ```bash - `diffusers` version: 0.15.1 - Python version: 3.8.16 - PyTorch version (GPU?): 1.13.1+cu116 (True) - Huggingface_hub version: 0.13.2 - Transformers version: 4.27.2 - Accelerate version: 0.18.0 - xFormers version: 0.0.16 - tomesd version: 0.1.2 ``` To reproduce this benchmark, feel free to use this [script](https://gist.github.com/sayakpaul/27aec6bca7eb7b0e0aa4112205850335). The results are reported in seconds, and where applicable we report the speed-up percentage over the vanilla pipeline when using ToMe and ToMe + xFormers. | **GPU** | **Resolution** | **Batch size** | **Vanilla** | **ToMe** | **ToMe + xFormers** | |----------|----------------|----------------|-------------|----------------|---------------------| | **A100** | 512 | 10 | 6.88 | 5.26 (+23.55%) | 4.69 (+31.83%) | | | 768 | 10 | OOM | 14.71 | 11 | | | | 8 | OOM | 11.56 | 8.84 | | | | 4 | OOM | 5.98 | 4.66 | | | | 2 | 4.99 | 3.24 (+35.07%) | 2.1 (+37.88%) | | | | 1 | 3.29 | 2.24 (+31.91%) | 2.03 (+38.3%) | | | 1024 | 10 | OOM | OOM | OOM | | | | 8 | OOM | OOM | OOM | | | | 4 | OOM | 12.51 | 9.09 | | | | 2 | OOM | 6.52 | 4.96 | | | | 1 | 6.4 | 3.61 (+43.59%) | 2.81 (+56.09%) | | **V100** | 512 | 10 | OOM | 10.03 | 9.29 | | | | 8 | OOM | 8.05 | 7.47 | | | | 4 | 5.7 | 4.3 (+24.56%) | 3.98 (+30.18%) | | | | 2 | 3.14 | 2.43 (+22.61%) | 2.27 (+27.71%) | | | | 1 | 1.88 | 1.57 (+16.49%) | 1.57 (+16.49%) | | | 768 | 10 | OOM | OOM | 23.67 | | | | 8 | OOM | OOM | 18.81 | | | | 4 | OOM | 11.81 | 9.7 | | | | 2 | OOM | 6.27 | 5.2 | | | | 1 | 5.43 | 3.38 (+37.75%) | 2.82 (+48.07%) | | | 1024 | 10 | OOM | OOM | OOM | | | | 8 | OOM | OOM | OOM | | | | 4 | OOM | OOM | 19.35 | | | | 2 | OOM | 13 | 10.78 | | | | 1 | OOM | 6.66 | 5.54 | As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](fp16#torchcompile). ================================================ FILE: docs/source/en/optimization/xdit.md ================================================ # xDiT [xDiT](https://github.com/xdit-project/xDiT) is an inference engine designed for the large scale parallel deployment of Diffusion Transformers (DiTs). xDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations. There are four parallel methods supported in xDiT, including [Unified Sequence Parallelism](https://huggingface.co/papers/2405.07719), [PipeFusion](https://huggingface.co/papers/2405.14430), CFG parallelism and data parallelism. The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware. Optimization orthogonal to parallelization focuses on accelerating single GPU performance. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as torch.compile and onediff. The overview of xDiT is shown as follows.
You can install xDiT using the following command: ```bash pip install xfuser ``` Here's an example of using xDiT to accelerate inference of a Diffusers model. ```diff import torch from diffusers import StableDiffusion3Pipeline from xfuser import xFuserArgs, xDiTParallel from xfuser.config import FlexibleArgumentParser from xfuser.core.distributed import get_world_group def main(): + parser = FlexibleArgumentParser(description="xFuser Arguments") + args = xFuserArgs.add_cli_args(parser).parse_args() + engine_args = xFuserArgs.from_cli_args(args) + engine_config, input_config = engine_args.create_config() local_rank = get_world_group().local_rank pipe = StableDiffusion3Pipeline.from_pretrained( pretrained_model_name_or_path=engine_config.model_config.model, torch_dtype=torch.float16, ).to(f"cuda:{local_rank}") # do anything you want with pipeline here + pipe = xDiTParallel(pipe, engine_config, input_config) pipe( height=input_config.height, width=input_config.height, prompt=input_config.prompt, num_inference_steps=input_config.num_inference_steps, output_type=input_config.output_type, generator=torch.Generator(device="cuda").manual_seed(input_config.seed), ) + if input_config.output_type == "pil": + pipe.save("results", "stable_diffusion_3") if __name__ == "__main__": main() ``` As you can see, we only need to use xFuserArgs from xDiT to get configuration parameters, and pass these parameters along with the pipeline object from the Diffusers library into xDiTParallel to complete the parallelization of a specific pipeline in Diffusers. xDiT runtime parameters can be viewed in the command line using `-h`, and you can refer to this [usage](https://github.com/xdit-project/xDiT?tab=readme-ov-file#2-usage) example for more details. xDiT needs to be launched using torchrun to support its multi-node, multi-GPU parallel capabilities. For example, the following command can be used for 8-GPU parallel inference: ```bash torchrun --nproc_per_node=8 ./inference.py --model models/FLUX.1-dev --data_parallel_degree 2 --ulysses_degree 2 --ring_degree 2 --prompt "A snowy mountain" "A small dog" --num_inference_steps 50 ``` ## Supported models A subset of Diffusers models are supported in xDiT, such as Flux.1, Stable Diffusion 3, etc. The latest supported models can be found [here](https://github.com/xdit-project/xDiT?tab=readme-ov-file#-supported-dits). ## Benchmark We tested different models on various machines, and here is some of the benchmark data. ### Flux.1-schnell
### Stable Diffusion 3
### HunyuanDiT
More detailed performance metric can be found on our [github page](https://github.com/xdit-project/xDiT?tab=readme-ov-file#perf). ## Reference [xDiT-project](https://github.com/xdit-project/xDiT) [USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](https://huggingface.co/papers/2405.07719) [PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models](https://huggingface.co/papers/2405.14430) ================================================ FILE: docs/source/en/optimization/xformers.md ================================================ # xFormers We recommend [xFormers](https://github.com/facebookresearch/xformers) for both inference and training. In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption. Install xFormers from `pip`: ```bash pip install xformers ``` > [!TIP] > The xFormers `pip` package requires the latest version of PyTorch. If you need to use a previous version of PyTorch, then we recommend [installing xFormers from the source](https://github.com/facebookresearch/xformers#installing-xformers). After xFormers is installed, you can use it with [`~ModelMixin.set_attention_backend`] as shown in the [Attention backends](./attention_backends) guide. > [!WARNING] > According to this [issue](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212), xFormers `v0.0.16` cannot be used for training (fine-tune or DreamBooth) in some GPUs. If you observe this problem, please install a development version as indicated in the issue comments. ================================================ FILE: docs/source/en/quantization/bitsandbytes.md ================================================ # bitsandbytes [bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. This guide demonstrates how quantization can enable running [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) on less than 16GB of VRAM and even on a free Google Colab instance. ![comparison image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quant-bnb/comparison.png) To use bitsandbytes, make sure you have the following libraries installed: ```bash pip install diffusers transformers accelerate bitsandbytes -U ``` Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. Quantizing a model in 8-bit halves the memory-usage: bitsandbytes is supported in both Transformers and Diffusers, so you can quantize both the [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. > [!TIP] > The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. ```py from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig import torch from diffusers import AutoModel from transformers import T5EncoderModel quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,) text_encoder_2_8bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,) transformer_8bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) ``` By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. ```diff transformer_8bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, + torch_dtype=torch.float32, ) ``` Let's generate an image using our quantized models. Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. ```py from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer_8bit, text_encoder_2=text_encoder_2_8bit, torch_dtype=torch.float16, device_map="auto", ) pipe_kwargs = { "prompt": "A cat holding a sign that says hello world", "height": 1024, "width": 1024, "guidance_scale": 3.5, "num_inference_steps": 50, "max_sequence_length": 512, } image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] ```
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 8-bit models locally with [`~ModelMixin.save_pretrained`].
Quantizing a model in 4-bit reduces your memory-usage by 4x: bitsandbytes is supported in both Transformers and Diffusers, so you can can quantize both the [`FluxTransformer2DModel`] and [`~transformers.T5EncoderModel`]. For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`. > [!TIP] > The [`CLIPTextModel`] and [`AutoencoderKL`] aren't quantized because they're already small in size and because [`AutoencoderKL`] only has a few `torch.nn.Linear` layers. ```py from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig import torch from diffusers import AutoModel from transformers import T5EncoderModel quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,) text_encoder_2_4bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) ``` By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter. ```diff transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, + torch_dtype=torch.float32, ) ``` Let's generate an image using our quantized models. Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory. ```py from diffusers import FluxPipeline pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer_4bit, text_encoder_2=text_encoder_2_4bit, torch_dtype=torch.float16, device_map="auto", ) pipe_kwargs = { "prompt": "A cat holding a sign that says hello world", "height": 1024, "width": 1024, "guidance_scale": 3.5, "num_inference_steps": 50, "max_sequence_length": 512, } image = pipe(**pipe_kwargs, generator=torch.manual_seed(0),).images[0] ```
When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply [`~DiffusionPipeline.enable_model_cpu_offload`] to optimize GPU memory usage. Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].
> [!WARNING] > Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. Check your memory footprint with the `get_memory_footprint` method: ```py print(model.get_memory_footprint()) ``` Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements. Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters: ```py from diffusers import AutoModel, BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_4bit=True) model_4bit = AutoModel.from_pretrained( "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer" ) ``` ## 8-bit (LLM.int8() algorithm) > [!TIP] > Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion. ### Outlier threshold An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: ```py from diffusers import AutoModel, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=10, ) model_8bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quantization_config, ) ``` ### Skip module conversion For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: ```py from diffusers import SD3Transformer2DModel, BitsAndBytesConfig quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_skip_modules=["proj_out"], ) model_8bit = SD3Transformer2DModel.from_pretrained( "stabilityai/stable-diffusion-3-medium-diffusers", subfolder="transformer", quantization_config=quantization_config, ) ``` ## 4-bit (QLoRA algorithm) > [!TIP] > Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. ### Compute data type To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: ```py import torch from diffusers import BitsAndBytesConfig quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) ``` ### Normal Float 4 (NF4) NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: ```py from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig from diffusers import AutoModel from transformers import T5EncoderModel quant_config = TransformersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", ) text_encoder_2_4bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", ) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) ``` For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. ### Nested quantization Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. ```py from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig from diffusers import AutoModel from transformers import T5EncoderModel quant_config = TransformersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) text_encoder_2_4bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) ``` ## Dequantizing `bitsandbytes` models Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality. Make sure you have enough GPU RAM to fit the dequantized model. ```python from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig from diffusers import AutoModel from transformers import T5EncoderModel quant_config = TransformersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) text_encoder_2_4bit = T5EncoderModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="text_encoder_2", quantization_config=quant_config, torch_dtype=torch.float16, ) quant_config = DiffusersBitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, ) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) text_encoder_2_4bit.dequantize() transformer_4bit.dequantize() ``` ## torch.compile Speed up inference with `torch.compile`. Make sure you have the latest `bitsandbytes` installed and we also recommend installing [PyTorch nightly](https://pytorch.org/get-started/locally/). ```py torch._dynamo.config.capture_dynamic_output_shape_ops = True quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) transformer_4bit.compile(fullgraph=True) ``` ```py quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True) transformer_4bit = AutoModel.from_pretrained( "black-forest-labs/FLUX.1-dev", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.float16, ) transformer_4bit.compile(fullgraph=True) ``` On an RTX 4090 with compilation, 4-bit Flux generation completed in 25.809 seconds versus 32.570 seconds without. Check out the [benchmarking script](https://gist.github.com/sayakpaul/0db9d8eeeb3d2a0e5ed7cf0d9ca19b7d) for more details. ## Resources * [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4) * [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization) ================================================ FILE: docs/source/en/quantization/gguf.md ================================================ # GGUF The GGUF file format is typically used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and supports a variety of block wise quantization options. Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes. Loading GGUF checkpoints via Pipelines is currently not supported. The following example will load the [FLUX.1 DEV](https://huggingface.co/black-forest-labs/FLUX.1-dev) transformer model using the GGUF Q2_K quantization variant. Before starting please install gguf in your environment ```shell pip install -U gguf ``` Since GGUF is a single file format, use [`~FromSingleFileMixin.from_single_file`] to load the model and pass in the [`GGUFQuantizationConfig`]. When using GGUF checkpoints, the quantized weights remain in a low memory `dtype`(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module's forward pass through the model. The `GGUFQuantizationConfig` allows you to set the `compute_dtype`. The functions used for dynamic dequantizatation are based on the great work done by [city96](https://github.com/city96/ComfyUI-GGUF), who created the Pytorch ports of the original [`numpy`](https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/gguf/quants.py) implementation by [compilade](https://github.com/compilade). ```python import torch from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig ckpt_path = ( "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf" ) transformer = FluxTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), torch_dtype=torch.bfloat16, ) pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16, ) pipe.enable_model_cpu_offload() prompt = "A cat holding a sign that says hello world" image = pipe(prompt, generator=torch.manual_seed(0)).images[0] image.save("flux-gguf.png") ``` ## Using Optimized CUDA Kernels with GGUF Optimized CUDA kernels can accelerate GGUF quantized model inference by approximately 10%. This functionality requires a compatible GPU with `torch.cuda.get_device_capability` greater than 7 and the kernels library: ```shell pip install -U kernels ``` Once installed, set `DIFFUSERS_GGUF_CUDA_KERNELS=true` to use optimized kernels when available. Note that CUDA kernels may introduce minor numerical differences compared to the original GGUF implementation, potentially causing subtle visual variations in generated images. To disable CUDA kernel usage, set the environment variable `DIFFUSERS_GGUF_CUDA_KERNELS=false`. ## Supported Quantization Types - BF16 - Q4_0 - Q4_1 - Q5_0 - Q5_1 - Q8_0 - Q2_K - Q3_K - Q4_K - Q5_K - Q6_K ## Convert to GGUF Use the Space below to convert a Diffusers checkpoint into the GGUF format for inference. run conversion: ```py import torch from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig ckpt_path = ( "https://huggingface.co/sayakpaul/different-lora-from-civitai/blob/main/flux_dev_diffusers-q4_0.gguf" ) transformer = FluxTransformer2DModel.from_single_file( ckpt_path, quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16), config="black-forest-labs/FLUX.1-dev", subfolder="transformer", torch_dtype=torch.bfloat16, ) pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", transformer=transformer, torch_dtype=torch.bfloat16, ) pipe.enable_model_cpu_offload() prompt = "A cat holding a sign that says hello world" image = pipe(prompt, generator=torch.manual_seed(0)).images[0] image.save("flux-gguf.png") ``` When using Diffusers format GGUF checkpoints, it's a must to provide the model `config` path. If the model config resides in a `subfolder`, that needs to be specified, too. ================================================ FILE: docs/source/en/quantization/modelopt.md ================================================ # NVIDIA ModelOpt [NVIDIA-ModelOpt](https://github.com/NVIDIA/Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed. Before you begin, make sure you have nvidia_modelopt installed. ```bash pip install -U "nvidia_modelopt[hf]" ``` Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. The example below only quantizes the weights to FP8. ```python import torch from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" dtype = torch.bfloat16 quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") transformer = AutoModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=dtype, ) pipe = SanaPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=dtype, ) pipe.to("cuda") print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") prompt = "A cat holding a sign that says hello world" image = pipe( prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 ).images[0] image.save("output.png") ``` > **Note:** > > The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration. > > More details can be found [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples). ## NVIDIAModelOptConfig The `NVIDIAModelOptConfig` class accepts three parameters: - `quant_type`: A string value mentioning one of the quantization types below. - `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`. - `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead. - `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details. - `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only. - `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`. ## Supported quantization types ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference. Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. The quantization methods supported are as follows: | **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** | |-----------------------|-----------------------|---------------------|----------------------| | **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | | **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | | **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| | **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` | | **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| Refer to the [official modelopt documentation](https://nvidia.github.io/Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available. ## Serializing and Deserializing quantized models To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method. ```python import torch from diffusers import AutoModel, NVIDIAModelOptConfig from modelopt.torch.opt import enable_huggingface_checkpointing enable_huggingface_checkpointing() model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"} quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8) model = AutoModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quant_config_fp8, torch_dtype=torch.bfloat16, ) model.save_pretrained('path/to/sana_fp8', safe_serialization=False) ``` To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method. ```python import torch from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline from modelopt.torch.opt import enable_huggingface_checkpointing enable_huggingface_checkpointing() quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") transformer = AutoModel.from_pretrained( "path/to/sana_fp8", subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) pipe = SanaPipeline.from_pretrained( "Efficient-Large-Model/Sana_600M_1024px_diffusers", transformer=transformer, torch_dtype=torch.bfloat16, ) pipe.to("cuda") prompt = "A cat holding a sign that says hello world" image = pipe( prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 ).images[0] image.save("output.png") ``` ================================================ FILE: docs/source/en/quantization/overview.md ================================================ # Getting started Quantization focuses on representing data with fewer bits while also trying to preserve the precision of the original data. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. Diffusers supports multiple quantization backends to make large diffusion models like [Flux](../api/pipelines/flux) more accessible. This guide shows how to use the [`~quantizers.PipelineQuantizationConfig`] class to quantize a pipeline during its initialization from a pretrained or non-quantized checkpoint. ## Pipeline-level quantization There are two ways to use [`~quantizers.PipelineQuantizationConfig`] depending on how much customization you want to apply to the quantization configuration. - for basic use cases, define the `quant_backend`, `quant_kwargs`, and `components_to_quantize` arguments - for granular quantization control, define a `quant_mapping` that provides the quantization configuration for individual model components ### Basic quantization Initialize [`~quantizers.PipelineQuantizationConfig`] with the following parameters. - `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`. - `quant_kwargs` specifies the quantization arguments to use. > [!TIP] > These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend. - `components_to_quantize` specifies which component(s) of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact. `components_to_quantize` accepts either a list for multiple models or a string for a single model. The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`. ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder_2"], ) ``` Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. ```py pipe = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") image = pipe("photo of a cute dog").images[0] ``` ### Advanced quantization The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends. Initialize [`~quantizers.PipelineQuantizationConfig`] and pass a `quant_mapping` to it. The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder. The example below uses two quantization backends, [`~quantizers.quantization_config.QuantoConfig`] and [`transformers.BitsAndBytesConfig`], for the transformer and text encoder. ```py import torch from diffusers import DiffusionPipeline from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from diffusers.quantizers.quantization_config import QuantoConfig from diffusers.quantizers import PipelineQuantizationConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={ "transformer": QuantoConfig(weights_dtype="int8"), "text_encoder_2": TransformersBitsAndBytesConfig( load_in_4bit=True, compute_dtype=torch.bfloat16 ), } ) ``` There is a separate bitsandbytes backend in [Transformers](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.BitsAndBytesConfig). You need to import and use [`transformers.BitsAndBytesConfig`] for components that come from Transformers. For example, `text_encoder_2` in [`FluxPipeline`] is a [`~transformers.T5EncoderModel`] from Transformers so you need to use [`transformers.BitsAndBytesConfig`] instead of [`diffusers.BitsAndBytesConfig`]. > [!TIP] > Use the [basic quantization](#basic-quantization) method above if you don't want to manage these distinct imports or aren't sure where each pipeline component comes from. ```py import torch from diffusers import DiffusionPipeline from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig from diffusers.quantizers import PipelineQuantizationConfig from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={ "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16), "text_encoder_2": TransformersBitsAndBytesConfig( load_in_4bit=True, compute_dtype=torch.bfloat16 ), } ) ``` Pass the `pipeline_quant_config` to [`~DiffusionPipeline.from_pretrained`] to quantize the pipeline. ```py pipe = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ).to("cuda") image = pipe("photo of a cute dog").images[0] ``` ## Resources Check out the resources below to learn more about quantization. - If you are new to quantization, we recommend checking out the following beginner-friendly courses in collaboration with DeepLearning.AI. - [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/) - [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/) - Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) if you're interested in adding a new quantization method. - The Transformers quantization [Overview](https://huggingface.co/docs/transformers/quantization/overview#when-to-use-what) provides an overview of the pros and cons of different quantization backends. - Read the [Exploring Quantization Backends in Diffusers](https://huggingface.co/blog/diffusers-quantization) blog post for a brief introduction to each quantization backend, how to choose a backend, and combining quantization with other memory optimizations. ================================================ FILE: docs/source/en/quantization/quanto.md ================================================ # Quanto [Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind: - All features are available in eager mode (works with non-traceable models) - Supports quantization aware training - Quantized models are compatible with `torch.compile` - Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU) In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate` ```shell pip install optimum-quanto accelerate ``` Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto. ```python import torch from diffusers import FluxTransformer2DModel, QuantoConfig model_id = "black-forest-labs/FLUX.1-dev" quantization_config = QuantoConfig(weights_dtype="float8") transformer = FluxTransformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype) pipe.to("cuda") prompt = "A cat holding a sign that says hello world" image = pipe( prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 ).images[0] image.save("output.png") ``` ## Skipping Quantization on specific modules It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict` ```python import torch from diffusers import FluxTransformer2DModel, QuantoConfig model_id = "black-forest-labs/FLUX.1-dev" quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"]) transformer = FluxTransformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) ``` ## Using `from_single_file` with the Quanto Backend `QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`. ```python import torch from diffusers import FluxTransformer2DModel, QuantoConfig ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors" quantization_config = QuantoConfig(weights_dtype="float8") transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16) ``` ## Saving Quantized models Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method. The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained` ```python import torch from diffusers import FluxTransformer2DModel, QuantoConfig model_id = "black-forest-labs/FLUX.1-dev" quantization_config = QuantoConfig(weights_dtype="float8") transformer = FluxTransformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) # save quantized model to reuse transformer.save_pretrained("") # you can reload your quantized model with model = FluxTransformer2DModel.from_pretrained("") ``` ## Using `torch.compile` with Quanto Currently the Quanto backend supports `torch.compile` for the following quantization types: - `int8` weights ```python import torch from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig model_id = "black-forest-labs/FLUX.1-dev" quantization_config = QuantoConfig(weights_dtype="int8") transformer = FluxTransformer2DModel.from_pretrained( model_id, subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True) pipe = FluxPipeline.from_pretrained( model_id, transformer=transformer, torch_dtype=torch_dtype ) pipe.to("cuda") images = pipe("A cat holding a sign that says hello").images[0] images.save("flux-quanto-compile.png") ``` ## Supported Quantization Types ### Weights - float8 - int8 - int4 - int2 ================================================ FILE: docs/source/en/quantization/torchao.md ================================================ # torchao [torchao](https://github.com/pytorch/ao) provides high-performance dtypes and optimizations based on quantization and sparsity for inference and training PyTorch models. It is supported for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. Make sure Pytorch 2.5+ and torchao are installed with the command below. ```bash uv pip install -U torch torchao ``` Each quantization dtype is available as a separate instance of a [AOBaseConfig](https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize) class. This provides more flexible configuration options by exposing more available arguments. Pass the `AOBaseConfig` of a quantization dtype, like [Int4WeightOnlyConfig](https://docs.pytorch.org/ao/main/generated/torchao.quantization.Int4WeightOnlyConfig) to [`TorchAoConfig`] in [`~ModelMixin.from_pretrained`]. ```py import torch from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig from torchao.quantization import Int8WeightOnlyConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128)))} ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, device_map="cuda" ) ``` For simple use cases, you could also provide a string identifier in [`TorchAo`] as shown below. ```py import torch from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={"transformer": TorchAoConfig("int8wo")} ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, device_map="cuda" ) ``` ## torch.compile torchao supports [torch.compile](../optimization/fp16#torchcompile) which can speed up inference with one line of code. ```python import torch from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig from torchao.quantization import Int4WeightOnlyConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={"transformer": TorchAoConfig(Int4WeightOnlyConfig(group_size=128))} ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.transformer.compile(transformer, mode="max-autotune", fullgraph=True) ``` Refer to this [table](https://github.com/huggingface/diffusers/pull/10009#issue-2688781450) for inference speed and memory usage benchmarks with Flux and CogVideoX. More benchmarks on various hardware are also available in the torchao [repository](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks). > [!TIP] > The FP8 post-training quantization schemes in torchao are effective for GPUs with compute capability of at least 8.9 (RTX-4090, Hopper, etc.). FP8 often provides the best speed, memory, and quality trade-off when generating images and videos. We recommend combining FP8 and torch.compile if your GPU is compatible. ## Supported quantization types torchao supports weight-only quantization and weight and dynamic-activation quantization for int8, float3-float8, and uint1-uint7. Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory. This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations. However, this may come at a quality tradeoff at times, so it is recommended to test different models thoroughly. The quantization methods supported are as follows: | **Category** | **Full Function Names** | **Shorthands** | |--------------|-------------------------|----------------| | **Integer quantization** | `int4_weight_only`, `int8_dynamic_activation_int4_weight`, `int8_weight_only`, `int8_dynamic_activation_int8_weight` | `int4wo`, `int4dq`, `int8wo`, `int8dq` | | **Floating point 8-bit quantization** | `float8_weight_only`, `float8_dynamic_activation_float8_weight`, `float8_static_activation_float8_weight` | `float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`, `float8dq_e4m3_tensor`, `float8dq_e4m3_row` | | **Floating point X-bit quantization** | `fpx_weight_only` | `fpX_eAwB` where `X` is the number of bits (1-7), `A` is exponent bits, and `B` is mantissa bits. Constraint: `X == A + B + 1` | | **Unsigned Integer quantization** | `uintx_weight_only` | `uint1wo`, `uint2wo`, `uint3wo`, `uint4wo`, `uint5wo`, `uint6wo`, `uint7wo` | Some quantization methods are aliases (for example, `int8wo` is the commonly used shorthand for `int8_weight_only`). This allows using the quantization methods described in the torchao docs as-is, while also making it convenient to remember their shorthand notations. Refer to the [official torchao documentation](https://docs.pytorch.org/ao/stable/index.html) for a better understanding of the available quantization methods and the exhaustive list of configuration options available. ## Serializing and Deserializing quantized models To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method. ```python import torch from diffusers import AutoModel, TorchAoConfig quantization_config = TorchAoConfig("int8wo") transformer = AutoModel.from_pretrained( "black-forest-labs/Flux.1-Dev", subfolder="transformer", quantization_config=quantization_config, torch_dtype=torch.bfloat16, ) transformer.save_pretrained("/path/to/flux_int8wo", safe_serialization=False) ``` To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method. ```python import torch from diffusers import FluxPipeline, AutoModel transformer = AutoModel.from_pretrained("/path/to/flux_int8wo", torch_dtype=torch.bfloat16, use_safetensors=False) pipe = FluxPipeline.from_pretrained("black-forest-labs/Flux.1-Dev", transformer=transformer, torch_dtype=torch.bfloat16) pipe.to("cuda") prompt = "A cat holding a sign that says hello world" image = pipe(prompt, num_inference_steps=30, guidance_scale=7.0).images[0] image.save("output.png") ``` If you are using `torch<=2.6.0`, some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source. ```python import torch from accelerate import init_empty_weights from diffusers import FluxPipeline, AutoModel, TorchAoConfig # Serialize the model transformer = AutoModel.from_pretrained( "black-forest-labs/Flux.1-Dev", subfolder="transformer", quantization_config=TorchAoConfig("uint4wo"), torch_dtype=torch.bfloat16, ) transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB") # ... # Load the model state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu") with init_empty_weights(): transformer = AutoModel.from_config("/path/to/flux_uint4wo/config.json") transformer.load_state_dict(state_dict, strict=True, assign=True) ``` > [!TIP] > The [`AutoModel`] API is supported for PyTorch >= 2.6 as shown in the examples below. ## Resources - [TorchAO Quantization API](https://docs.pytorch.org/ao/stable/index.html) - [Diffusers-TorchAO examples](https://github.com/sayakpaul/diffusers-torchao) ================================================ FILE: docs/source/en/quicktour.md ================================================ # Quickstart Diffusers is a library for developers and researchers that provides an easy inference API for generating images, videos and audio, as well as the building blocks for implementing new workflows. Diffusers provides many optimizations out-of-the-box that makes it possible to load and run large models on setups with limited memory or to accelerate inference. This Quickstart will give you an overview of Diffusers and get you up and generating quickly. > [!TIP] > Before you begin, make sure you have a Hugging Face [account](https://huggingface.co/join) in order to use gated models like [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev). Follow the [Installation](./installation) guide to install Diffusers if it's not already installed. ## DiffusionPipeline A diffusion model combines multiple components to generate outputs in any modality based on an input, such as a text description, image or both. For a standard text-to-image model: 1. A text encoder turns a prompt into embeddings that guide the denoising process. Some models have more than one text encoder. 2. A scheduler contains the algorithmic specifics for gradually denoising initial random noise into clean outputs. Different schedulers affect generation speed and quality. 3. A UNet or diffusion transformer (DiT) is the workhorse of a diffusion model. At each step, it performs the denoising predictions, such as how much noise to remove or the general direction in which to steer the noise to generate better quality outputs. The UNet or DiT repeats this loop for a set amount of steps to generate the final output. 4. A variational autoencoder (VAE) encodes and decodes pixels to a spatially compressed latent-space. *Latents* are compressed representations of an image and are more efficient to work with. The UNet or DiT operates on latents, and the clean latents at the end are decoded back into images. The [`DiffusionPipeline`] packages all these components into a single class for inference. There are several arguments in [`~DiffusionPipeline.__call__`] you can change, such as `num_inference_steps`, that affect the diffusion process. Try different values and arguments to see how they change generation quality or speed. Load a model with [`~DiffusionPipeline.from_pretrained`] and describe what you'd like to generate. The example below uses the default argument values. Use `.images[0]` to access the generated image output. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" ) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] ``` Use `.frames[0]` to access the generated video output and [`~utils.export_to_video`] to save the video. ```py import torch from diffusers import AutoencoderKLWan, DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video vae = AutoencoderKLWan.from_pretrained( "Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="vae", torch_dtype=torch.float32 ) pipeline = DiffusionPipeline.from_pretrained( "Wan-AI/Wan2.2-T2V-A14B-Diffusers", vae=vae torch_dtype=torch.bfloat16, device_map="cuda" ) prompt = """ Cinematic video of a sleek cat lounging on a colorful inflatable in a crystal-clear turquoise pool in Palm Springs, sipping a salt-rimmed margarita through a straw. Golden-hour sunlight glows over mid-century modern homes and swaying palms. Shot in rich Sony a7S III: with moody, glamorous color grading, subtle lens flares, and soft vintage film grain. Ripples shimmer as a warm desert breeze stirs the water, blending luxury and playful charm in an epic, gorgeously composed frame. """ video = pipeline(prompt=prompt, num_frames=81, num_inference_steps=40).frames[0] export_to_video(video, "output.mp4", fps=16) ``` ## LoRA Adapters insert a small number of trainable parameters to the original base model. Only the inserted parameters are fine-tuned while the rest of the model weights remain frozen. This makes it fast and cheap to fine-tune a model on a new style. Among adapters, [LoRA's](./tutorials/using_peft_for_inference) are the most popular. Add a LoRA to a pipeline with the [`~loaders.QwenImageLoraLoaderMixin.load_lora_weights`] method. Some LoRA's require a special word to trigger it, such as `Realism`, in the example below. Check a LoRA's model card to see if it requires a trigger word. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.load_lora_weights( "flymy-ai/qwen-image-realism-lora", ) prompt = """ super Realism cinematic film still of a cat sipping a margarita in a pool in Palm Springs in the style of umempart, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] ``` Check out the [LoRA](./tutorials/using_peft_for_inference) docs or Adapters section to learn more. ## Quantization [Quantization](./quantization/overview) stores data in fewer bits to reduce memory usage. It may also speed up inference because it takes less time to perform calculations with fewer bits. Diffusers provides several quantization backends and picking one depends on your use case. For example, [bitsandbytes](./quantization/bitsandbytes) and [torchao](./quantization/torchao) are both simple and easy to use for inference, but torchao supports more [quantization types](./quantization/torchao#supported-quantization-types) like fp8. Configure [`PipelineQuantizationConfig`] with the backend to use, the specific arguments (refer to the [API](./api/quantization) reference for available arguments) for that backend, and which components to quantize. The example below quantizes the model to 4-bits and only uses 14.93GB of memory. ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder"], ) pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, quantization_config=quant_config, device_map="cuda" ) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` Take a look at the [Quantization](./quantization/overview) section for more details. ## Optimizations > [!TIP] > Optimization is dependent on hardware specs such as memory. Use this [Space](https://huggingface.co/spaces/diffusers/optimized-diffusers-code) to generate code examples that include all of Diffusers' available memory and speed optimization techniques for any model you're using. Modern diffusion models are very large and have billions of parameters. The iterative denoising process is also computationally intensive and slow. Diffusers provides techniques for reducing memory usage and boosting inference speed. These techniques can be combined with quantization to optimize for both memory usage and inference speed. ### Memory usage The text encoders and UNet or DiT can use up as much as ~30GB of memory, exceeding the amount available on many free-tier or consumer GPUs. Offloading stores weights that aren't currently used on the CPU and only moves them to the GPU when they're needed. There are a few offloading types and the example below uses [model offloading](./optimization/memory#model-offloading). This moves an entire model, like a text encoder or transformer, to the CPU when it isn't actively being used. Call [`~DiffusionPipeline.enable_model_cpu_offload`] to activate it. By combining quantization and offloading, the following example only requires ~12.54GB of memory. ```py import torch from diffusers import DiffusionPipeline from diffusers.quantizers import PipelineQuantizationConfig quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16}, components_to_quantize=["transformer", "text_encoder"], ) pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, quantization_config=quant_config, device_map="cuda" ) pipeline.enable_model_cpu_offload() prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` Refer to the [Reduce memory usage](./optimization/memory) docs to learn more about other memory reducing techniques. ### Inference speed The denoising loop performs a lot of computations and can be slow. Methods like [torch.compile](./optimization/fp16#torchcompile) increases inference speed by compiling the computations into an optimized kernel. Compilation is slow for the first generation but successive generations should be much faster. The example below uses [regional compilation](./optimization/fp16#regional-compilation) to only compile small regions of a model. It reduces cold-start latency while also providing a runtime speed up. Call [`~ModelMixin.compile_repeated_blocks`] on the model to activate it. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.transformer.compile_repeated_blocks( fullgraph=True, ) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] ``` Check out the [Accelerate inference](./optimization/fp16) or [Caching](./optimization/cache) docs for more methods that speed up inference. ================================================ FILE: docs/source/en/stable_diffusion.md ================================================ [[open-in-colab]] # Basic performance Diffusion is a random process that is computationally demanding. You may need to run the [`DiffusionPipeline`] several times before getting a desired output. That's why it's important to carefully balance generation speed and memory usage in order to iterate faster, This guide recommends some basic performance tips for using the [`DiffusionPipeline`]. Refer to the Inference Optimization section docs such as [Accelerate inference](./optimization/fp16) or [Reduce memory usage](./optimization/memory) for more detailed performance guides. ## Memory usage Reducing the amount of memory used indirectly speeds up generation and can help a model fit on device. The [`~DiffusionPipeline.enable_model_cpu_offload`] method moves a model to the CPU when it is not in use to save GPU memory. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.enable_model_cpu_offload() prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ pipeline(prompt).images[0] print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB") ``` ## Inference speed Denoising is the most computationally demanding process during diffusion. Methods that optimizes this process accelerates inference speed. Try the following methods for a speed up. - Add `device_map="cuda"` to place the pipeline on a GPU. Placing a model on an accelerator, like a GPU, increases speed because it performs computations in parallel. - Set `torch_dtype=torch.bfloat16` to execute the pipeline in half-precision. Reducing the data type precision increases speed because it takes less time to perform computations in a lower precision. ```py import torch import time from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16, device_map="cuda ) ``` - Use a faster scheduler, such as [`DPMSolverMultistepScheduler`], which only requires ~20-25 steps. - Set `num_inference_steps` to a lower value. Reducing the number of inference steps reduces the overall number of computations. However, this can result in lower generation quality. ```py pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ start_time = time.perf_counter() image = pipeline(prompt).images[0] end_time = time.perf_counter() print(f"Image generation took {end_time - start_time:.3f} seconds") ``` ## Generation quality Many modern diffusion models deliver high-quality images out-of-the-box. However, you can still improve generation quality by trying the following. - Try a more detailed and descriptive prompt. Include details such as the image medium, subject, style, and aesthetic. A negative prompt may also help by guiding a model away from undesirable features by using words like low quality or blurry. ```py import torch from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16, device_map="cuda" ) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ negative_prompt = "low quality, blurry, ugly, poor details" pipeline(prompt, negative_prompt=negative_prompt).images[0] ``` For more details about creating better prompts, take a look at the [Prompt techniques](./using-diffusers/weighted_prompts) doc. - Try a different scheduler, like [`HeunDiscreteScheduler`] or [`LMSDiscreteScheduler`], that gives up generation speed for quality. ```py import torch from diffusers import DiffusionPipeline, HeunDiscreteScheduler pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16, device_map="cuda" ) pipeline.scheduler = HeunDiscreteScheduler.from_config(pipeline.scheduler.config) prompt = """ cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain """ negative_prompt = "low quality, blurry, ugly, poor details" pipeline(prompt, negative_prompt=negative_prompt).images[0] ``` ## Next steps Diffusers offers more advanced and powerful optimizations such as [group-offloading](./optimization/memory#group-offloading) and [regional compilation](./optimization/fp16#regional-compilation). To learn more about how to maximize performance, take a look at the Inference Optimization section. ================================================ FILE: docs/source/en/training/adapt_a_model.md ================================================ # Adapt a model to a new task Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task. This guide will show you how to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained [`UNet2DConditionModel`]. ## Configure UNet2DConditionModel parameters A [`UNet2DConditionModel`] by default accepts 4 channels in the [input sample](https://huggingface.co/docs/diffusers/v0.16.0/en/api/models#diffusers.UNet2DConditionModel.in_channels). For example, load a pretrained text-to-image model like [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5) and take a look at the number of `in_channels`: ```py from diffusers import StableDiffusionPipeline pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True) pipeline.unet.config["in_channels"] 4 ``` Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like [`stable-diffusion-v1-5/stable-diffusion-inpainting`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting): ```py from diffusers import StableDiffusionPipeline pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting", use_safetensors=True) pipeline.unet.config["in_channels"] 9 ``` To adapt your text-to-image model for inpainting, you'll need to change the number of `in_channels` from 4 to 9. Initialize a [`UNet2DConditionModel`] with the pretrained text-to-image model weights, and change `in_channels` to 9. Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now. ```py from diffusers import AutoModel model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5" unet = AutoModel.from_pretrained( model_id, subfolder="unet", in_channels=9, low_cpu_mem_usage=False, ignore_mismatched_sizes=True, use_safetensors=True, ) ``` The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the `unet` are randomly initialized. It is important to finetune the model for inpainting because otherwise the model returns noise. ================================================ FILE: docs/source/en/training/cogvideox.md ================================================ # CogVideoX CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods. - a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy. - an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos. The actual test of the video instruction dimension found that CogVideoX has good effects on consistent theme, dynamic information, consistent background, object information, smooth motion, color, scene, appearance style, and temporal style but cannot achieve good results with human action, spatial relationship, and multiple objects. Finetuning with Diffusers can help make up for these poor results. ## Data Preparation The training scripts accepts data in two formats. The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training. In the future, Diffusers will support the `