Denoising diffusion probabilistic models (DDPMs)
excel in image generation, but users have limited
control over the level of detail and semantic richness
in generated images. Although prompt-based
diffusion models can create more detailed images
with descriptive prompts and utilize spatial masks
to preserve unedited regions, diffusion models frequently
overlook these constraints, leading to inconsistent
image regions. Inspired by transformers,
where each feature level encodes varying semantic
information, we propose a feature scaling method
at inference for a ViT-based diffusion model, U-ViT.
Our experiments on CIFAR-10 indicate that this scaling
approach effectively adjusts the level of detail in
generated images.
PDF