Large-scale training of latent diffusion models (LDMs) has enabled
unprecedented quality in image generation. However, the key components of the
best performing LDM training recipes are oftentimes not available to the
research community, preventing apple-to-apple comparisons and hindering the
validation of progress in the field. In this work, we perform an in-depth study
of LDM training recipes focusing on the performance of models and their
training efficiency. To ensure apple-to-apple comparisons, we re-implement five
previously published models with their corresponding recipes. Through our
study, we explore the effects of (i) the mechanisms used to condition the
generative model on semantic information (e.g., text prompt) and control
metadata (e.g., crop size, random flip flag, etc.) on the model performance,
and (ii) the transfer of the representations learned on smaller and
lower-resolution datasets to larger ones on the training efficiency and model
performance. We then propose a novel conditioning mechanism that disentangles
semantic and control metadata conditionings and sets a new state-of-the-art in
class-conditional generation on the ImageNet-1k dataset -- with FID
improvements of 7% on 256 and 8% on 512 resolutions -- as well as text-to-image
generation on the CC12M dataset -- with FID improvements of 8% on 256 and 23%
on 512 resolution.