Skip to main navigation menu Skip to main content Skip to site footer

Image Generation at Different Detail Level: Scaling Skip Connections in ViT-based Diffusion Models

Abstract

Denoising diffusion probabilistic models (DDPMs) excel in image generation, but users have limited control over the level of detail and semantic richness in generated images. Although prompt-based diffusion models can create more detailed images with descriptive prompts and utilize spatial masks to preserve unedited regions, diffusion models frequently overlook these constraints, leading to inconsistent image regions. Inspired by transformers, where each feature level encodes varying semantic information, we propose a feature scaling method at inference for a ViT-based diffusion model, U-ViT. Our experiments on CIFAR-10 indicate that this scaling approach effectively adjusts the level of detail in generated images.
PDF