Google’s MobileDiffusion generates AI images on mobile devices in less than a second


Google’s MobileDiffusion is a fast and efficient way to create images from text on smartphones.

MobileDiffusion is Google’s latest development in text-to-image generation. Designed specifically for smartphones, the diffusion model generates high-quality images from text input in less than a second.

With a model size of only 520 million parameters, it is significantly smaller than models with billions of parameters such as Stable Diffusion and SDXL, making it more suitable for use on mobile devices.

The researchers’ tests show that MobileDiffusion can generate images with a resolution of 512 x 512 pixels in about half a second on both Android smartphones and iPhones. The output is continuously updated as you type, as Google’s demo video shows.



Video: Google

MobileDiffusion consists of three main components: a text encoder, a diffusion network, and an image decoder.

The UNet contains a self-attention layer, a cross-attention layer, and a feed-forward layer, which are crucial for text comprehension in diffusion models.

However, this layered architecture is computationally complex and resource intensive. Google uses a so-called UViT architecture, in which more transformer blocks are placed in a low-dimensional region of the UNet to reduce resource requirements.

In addition, distillation and a Generative Adversarial Network (GAN) hybrid are used for one- to eight-level sampling.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top