Apple Research is generating images with a forgotten AI technique - 9to5Mac

Today, most generative image models basically fall into two main categories: diffusion models, like Stable Diffusion, or autoregressive models, like OpenAI’s GPT-4o.But Apple just released two papers that show how there might be room for a third, forgotten technique: Normalizing Flows.And with a dash of Transformers on top, they might be more capable than previously thought.

First things first: What are Normalizing Flows? Normalizing Flows (NFs) are a type of AI model that works by learning how to mathematically transform real-world data (like images) into structured noise, and then reverse that process to generate new samples.The big advantage is that they can calculate the exact likelihood of each image they generate, a property that diffusion models can’t do.This makes flows especially appealing for tasks where understanding the probability of an outcome really matters.

But there’s a reason most people haven’t heard much about them lately: Early flow-based models produced images that looked blurry or lacked the detail and diversity offered by diffusion and transformer-based systems.Study #1: TarFlow In the paper “Normalizing Flows are Capable Generative Models”, Apple introduces a new model called TarFlow, short for Transformer AutoRegressive Flow.At its core, TarFlow replaces the old, handcrafted layers used in previous flow models with Transformer blocks.

Basically, it splits images into small patches, and generates them in blocks, with each block predicted based on all the ones that came before.That’s what’s called autoregressive, which is the same underlying method that OpenAI currently uses for image generation.The key difference is that while OpenAI generates discrete tokens, treating images like long sequences of text-like symbols, Apple’s TarFlow generates pixel values directly, without tokenizing the image first.

It’s a small, but significant difference because it lets Apple avoid the quality loss and rigidity that often come with compressing images into a fixed vocabulary of tokens.Still, there were limitations, especially when it came to scaling up to larger, high-res images.And that’s where the second study comes in.

Study #2: STARFlow In the paper “STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis”, Apple builds directly on TarFlow and presents STARFlow (Scalable Transformer AutoRegressive Flow), with key upgrades.The biggest change: STARFlow no longer generates images directly in pixel space.Instead, it basically works on a compressed version of the image, and then hands things off to a decoder that upsamples everything back to full resolution at the final step.

This shift to what is called latent space means STARFlow doesn’t need to predict millions of pixels directly.It can focus on the broader image structure first, leaving fine texture detail to the decoder.Apple also reworked how the model handles text prompts.

Instead of building a separate text encoder, STARFlow can plug in existing language models (like Google’s small language model Gemma, which in theory could run on-device) to handle language understanding when the user prompts the model to create the image.This keeps the image generation side of the model focused on refining visual details.How STARFlow compares with OpenAI’s 4o image generator While Apple is rethinking flows, OpenAI has also recently moved beyond diffusion with its GPT-4o model.

But their approach is fundamentally different.GPT-4o treats images as sequences of discrete tokens, much like words in a sentence.When you ask ChatGPT to generate an image, the model predicts one image token at a time, building the picture piece by piece.

This gives OpenAI enormous flexibility: the same model can generate text, images, and audio within a single, unified token stream.The tradeoff? Token-by-token generation can be slow, especially for large or high-resolution images.And it’s extremely computationally expensive.

But since GPT-4o runs entirely in the cloud, OpenAI isn’t as constrained by latency or power use.In short: both Apple and OpenAI are moving beyond diffusion, but while OpenAI is building for its data centers, Apple is clearly building for our pockets.  You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day.

Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop.Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Read More
Related Posts