bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689

Why AI language models choke on too much text​


Compute costs scale with the square of the input size. That's not great.

Timothy B. Lee – Dec 20, 2024 8:00 AM | 39



Credit: Aurich Lawson | Getty Images

Large language models represent text using tokens, each of which is a few characters. Short words are represented by a single token (like "the" or "it"), whereas larger words may be represented by several tokens (GPT-4o represents "indivisible" with "ind," "iv," and "isible").

When OpenAI released ChatGPT two years ago, it had a memory—known as a context window—of just 8,192 tokens. That works out to roughly 6,000 words of text. This meant that if you fed it more than about 15 pages of text, it would “forget” information from the beginning of its context. This limited the size and complexity of tasks ChatGPT could handle.

Today’s LLMs are far more capable:

  • OpenAI’s GPT-4o can handle 128,000 tokens (about 200 pages of text).
  • Anthropic’s Claude 3.5 Sonnet can accept 200,000 tokens (about 300 pages of text).
  • Google’s Gemini 1.5 Pro allows 2 million tokens (about 2,000 pages of text).

Still, it’s going to take a lot more progress if we want AI systems with human-level cognitive abilities.

Many people envision a future where AI systems are able to do many—perhaps most—of the jobs performed by humans. Yet many human workers read and hear hundreds of millions of words during our working years—and we absorb even more information from sights, sounds, and smells in the world around us. To achieve human-level intelligence, AI systems will need the capacity to absorb similar quantities of information.

Right now the most popular way to build an LLM-based system to handle large amounts of information is called retrieval-augmented generation (RAG). These systems try to find documents relevant to a user’s query and then insert the most relevant documents into an LLM’s context window.

This sometimes works better than a conventional search engine, but today’s RAG systems leave a lot to be desired. They only produce good results if the system puts the most relevant documents into the LLM’s context. But the mechanism used to find those documents—often, searching in a vector database—is not very sophisticated. If the user asks a complicated or confusing question, there’s a good chance the RAG system will retrieve the wrong documents and the chatbot will return the wrong answer.

And RAG doesn’t enable an LLM to reason in more sophisticated ways over large numbers of documents:

  • A lawyer might want an AI system to review and summarize hundreds of thousands of emails.
  • An engineer might want an AI system to analyze thousands of hours of camera footage from a factory floor.
  • A medical researcher might want an AI system to identify trends in tens of thousands of patient records.

Each of these tasks could easily require more than 2 million tokens of context. Moreover, we’re not going to want our AI systems to start with a clean slate after doing one of these jobs. We will want them to gain experience over time, just like human workers do.

Superhuman memory and stamina have long been key selling points for computers. We’re not going to want to give them up in the AI age. Yet today’s LLMs are distinctly subhuman in their ability to absorb and understand large quantities of information.

It’s true, of course, that LLMs absorb superhuman quantities of information at training time. The latest AI models have been trained on trillions of tokens—far more than any human will read or hear. But a lot of valuable information is proprietary, time-sensitive, or otherwise not available for training.

So we’re going to want AI models to read and remember far more than 2 million tokens at inference time. And that won’t be easy.

The key innovation behind transformer-based LLMs is attention, a mathematical operation that allows a model to “think about” previous tokens. (Check out our LLM explainer if you want a detailed explanation of how this works.) Before an LLM generates a new token, it performs an attention operation that compares the latest token to every previous token. This means that conventional LLMs get less and less efficient as the context grows.

Lots of people are working on ways to solve this problem—I’ll discuss some of them later in this article. But first I should explain how we ended up with such an unwieldy architecture.

GPUs made deep learning possible​


The “brains” of personal computers are central processing units (CPUs). Traditionally, chipmakers made CPUs faster by increasing the frequency of the clock that acts as its heartbeat. But in the early 2000s, overheating forced chipmakers to mostly abandon this technique.

Chipmakers started making CPUs that could execute more than one instruction at a time. But they were held back by a programming paradigm that requires instructions to mostly be executed in order.

A new architecture was needed to take full advantage of Moore’s Law. Enter Nvidia.

In 1999, Nvidia started selling graphics processing units (GPUs) to speed up the rendering of three-dimensional games like Quake III Arena. The job of these PC add-on cards was to rapidly draw thousands of triangles that made up walls, weapons, monsters, and other objects in a game.

This is not a sequential programming task: triangles in different areas of the screen can be drawn in any order. So rather than having a single processor that executed instructions one at a time, Nvidia’s first GPU had a dozen specialized cores—effectively tiny CPUs—that worked in parallel to paint a scene.

Over time, Moore’s Law enabled Nvidia to make GPUs with tens, hundreds, and eventually thousands of computing cores. People started to realize that the massive parallel computing power of GPUs could be used for applications unrelated to video games.

In 2012, three University of Toronto computer scientists—Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton—used a pair of Nvidia GTX 580 GPUs to train a neural network for recognizing images. The massive computing power of those GPUs, which had 512 cores each, allowed them to train a network with a then-impressive 60 million parameters. They entered ImageNet, an academic competition to classify images into one of 1,000 categories, and set a new record for accuracy in image recognition.

Before long, researchers were applying similar techniques to a wide variety of domains, including natural language.

Transformers removed a bottleneck for natural language​


In the early 2010s, recurrent neural networks (RNNs) were a popular architecture for understanding natural language. RNNs process language one word at a time. After each word, the network updates its hidden state, a list of numbers that reflects its understanding of the sentence so far.

RNNs worked fairly well on short sentences, but they struggled with longer ones—to say nothing of paragraphs or longer passages. When reasoning about a long sentence, an RNN would sometimes “forget about” an important word early in the sentence. In 2014, computer scientists Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio discovered they could improve the performance of a recurrent neural network by adding an attention mechanism that allowed the network to “look back” at earlier words in a sentence.

In 2017, Google published “Attention Is All You Need,” one of the most important papers in the history of machine learning. Building on the work of Bahdanau and his colleagues, Google researchers dispensed with the RNN and its hidden states. Instead, Google’s model used an attention mechanism to scan previous words for relevant context.

This new architecture, which Google called the transformer, proved hugely consequential because it eliminated a serious bottleneck to scaling language models.

Here’s an animation illustrating why RNNs didn’t scale well:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34b7514d-a8f6-450f-8adb-b9cea42c692e_960x635.gif


This hypothetical RNN tries to predict the next word in a sentence, with the prediction shown in the top row of the diagram. This network has three layers, each represented by a rectangle. It is inherently linear: it has to complete its analysis of the first word, “How,” before passing the hidden state back to the bottom layer so the network can start to analyze the second word, “are.”

This constraint wasn’t a big deal when machine learning algorithms ran on CPUs. But when people started leveraging the parallel computing power of GPUs, the linear architecture of RNNs became a serious obstacle.

The transformer removed this bottleneck by allowing the network to “think about” all the words in its input at the same time:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecedba87-ada6-49a1-8652-42f18e41fd55_1152x762.gif
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689
The transformer-based model shown here does roughly as many computations as the RNN in the previous diagram. So it might not run any faster on a (single-core) CPU. But because the model doesn’t need to finish with “How” before starting on “are,” “you,” or “doing,” it can work on all of these words simultaneously. So it can run a lot faster on a GPU with many parallel execution units.

How much faster? The potential speed-up is proportional to the number of input words. My animations depict a four-word input that makes the transformer model about four times faster than the RNN. Real LLMs can have inputs thousands of words long. So, with a sufficiently beefy GPU, transformer-based models can be orders of magnitude faster than otherwise similar RNNs.

In short, the transformer unlocked the full processing power of GPUs and catalyzed rapid increases in the scale of language models. Leading LLMs grew from hundreds of millions of parameters in 2018 to hundreds of billions of parameters by 2020. Classic RNN-based models could not have grown that large because their linear architecture prevented them from being trained efficiently on a GPU.

Transformers have a scaling problem​


Earlier I said that the recurrent neural network in my animations did “roughly the same amount of work” as the transformer-based network. But they don’t do exactly the same amount of work. Let’s look again at the diagram for the transformer-based model:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa1b2741-2309-4ff0-bc36-9c1f8be33c3a_1106x742.png


See all those diagonal arrows between the layers? They represent the operation of the attention mechanism. Before a transformer-based language model generates a new token, it “thinks about” every previous token to find the ones that are most relevant.

Each of these comparisons is cheap, computationally speaking. For small contexts—10, 100, or even 1,000 tokens—they are not a big deal. But the computational cost of attention grows relentlessly with the number of preceding tokens. The longer the context gets, the more attention operations (and therefore computing power) are needed to generate the next token.

This means that the total computing power required for attention grows quadratically with the total number of tokens. Suppose a 10-token prompt requires 414,720 attention operations. Then:

  • Processing a 100-token prompt will require 45.6 million attention operations.
  • Processing a 1,000-token prompt will require 4.6 billion attention operations.
  • Processing a 10,000-token prompt will require 460 billion attention operations.

This is probably why Google charges twice as much, per token, for Gemini 1.5 Pro once the context gets longer than 128,000 tokens. Generating token number 128,001 requires comparisons with all 128,000 previous tokens, making it significantly more expensive than producing the first or 10th or 100th token.

Making attention more efficient and scalable​


A lot of effort has been put into optimizing attention. One line of research has tried to squeeze maximum efficiency out of individual GPUs.

As we saw earlier, a modern GPU contains thousands of execution units. Before a GPU can start doing math, it must move data from slow shared memory (called high-bandwidth memory) to much faster memory inside a particular execution unit (called SRAM). Sometimes GPUs spend more time moving data around than performing calculations.

In a series of papers, Princeton computer scientist Tri Dao and several collaborators have developed FlashAttention, which calculates attention in a way that minimizes the number of these slow memory operations. Work like Dao’s has dramatically improved the performance of transformers on modern GPUs.

Another line of research has focused on efficiently scaling attention across multiple GPUs. One widely cited paper describes ring attention, which divides input tokens into blocks and assigns each block to a different GPU. It’s called ring attention because GPUs are organized into a conceptual ring, with each GPU passing data to its neighbor.

I once attended a ballroom dancing class where couples stood in a ring around the edge of the room. After each dance, women would stay where they were while men would rotate to the next woman. Over time, every man got a chance to dance with every woman. Ring attention works on the same principle. The “women” are query vectors (describing what each token is “looking for”) and the “men” are key vectors (describing the characteristics each token has). As the key vectors rotate through a sequence of GPUs, they get multiplied by every query vector in turn.

In short, ring attention distributes attention calculations across multiple GPUs, making it possible for LLMs to have larger context windows. But it doesn’t make individual attention calculations any cheaper.

Could RNNs make a comeback?​


The fixed-size hidden state of an RNN means that it doesn’t have the same scaling problems as a transformer. An RNN requires about the same amount of computing power to produce its first, hundredth and millionth token. That’s a big advantage over attention-based models.

Although RNNs have fallen out of favor since the invention of the transformer, people have continued trying to develop RNNs suitable for training on modern GPUs.

In April, Google announced a new model called Infini-attention. It’s kind of a hybrid between a transformer and an RNN. Infini-attention handles recent tokens like a normal transformer, remembering them and recalling them using an attention mechanism.

However, Infini-attention doesn’t try to remember every token in a model’s context. Instead, it stores older tokens in a “compressive memory” that works something like the hidden state of an RNN. This data structure can perfectly store and recall a few tokens, but as the number of tokens grows, its recall becomes lossier.

Machine learning YouTuber Yannic Kilcher wasn’t too impressed by Google’s approach.

“I’m super open to believing that this actually does work and this is the way to go for infinite attention, but I’m very skeptical,” Kilcher said. “It uses this compressive memory approach where you just store as you go along, you don’t really learn how to store, you just store in a deterministic fashion, which also means you have very little control over what you store and how you store it.”

Could Mamba be the future?​


Perhaps the most notable effort to resurrect RNNs is Mamba, an architecture that was announced in a December 2023 paper. It was developed by computer scientists Dao (who also did the FlashAttention work I mentioned earlier) and Albert Gu.

Mamba does not use attention. Like other RNNs, it has a hidden state that acts as the model’s “memory.” Because the hidden state has a fixed size, longer prompts do not increase Mamba’s per-token cost.

When I started writing this article in March, my goal was to explain Mamba’s architecture in some detail. But then in May, the researchers released Mamba-2, which significantly changed the architecture from the original Mamba paper. I’ll be frank: I struggled to understand the original Mamba and have not figured out how Mamba-2 works.

But the key thing to understand is that Mamba has the potential to combine transformer-like performance with the efficiency of conventional RNNs.

In June, Dao and Gu co-authored a paper with Nvidia researchers that evaluated a Mamba model with 8 billion parameters. They found that models like Mamba were competitive with comparably sized transformers in a number of tasks, but they “lag behind Transformer models when it comes to in-context learning and recalling information from the context.”

Transformers are good at information recall because they “remember” every token of their context—this is also why they become less efficient as the context grows. In contrast, Mamba tries to compress the context into a fixed-size state, which necessarily means discarding some information from long contexts.

The Nvidia team found they got the best performance from a hybrid architecture that interleaved 24 Mamba layers with four attention layers. This worked better than either a pure transformer model or a pure Mamba model.

A model needs some attention layers so it can remember important details from early in its context. But a few attention layers seem to be sufficient; the rest of the attention layers can be replaced by cheaper Mamba layers with little impact on the model’s overall performance.

In August, an Israeli startup called AI21 announced its Jamba 1.5 family of models. The largest version had 398 billion parameters, making it comparable in size to Meta’s Llama 405B model. Jamba 1.5 Large has seven times more Mamba layers than attention layers. As a result, Jamba 1.5 Large requires far less memory than comparable models from Meta and others. For example, AI21 estimates that Llama 3.1 70B needs 80GB of memory to keep track of 256,000 tokens of context. Jamba 1.5 Large only needs 9GB, allowing the model to run on much less powerful hardware.

The Jamba 1.5 Large model gets an MMLU score of 80, significantly below the Llama 3.1 70B’s score of 86. So by this measure, Mamba doesn’t blow transformers out of the water. However, this may not be an apples-to-apples comparison. Frontier labs like Meta have invested heavily in training data and post-training infrastructure to squeeze a few more percentage points of performance out of benchmarks like MMLU. It’s possible that the same kind of intense optimization could close the gap between Jamba and frontier models.

So while the benefits of longer context windows is obvious, the best strategy to get there is not. In the short term, AI companies may continue using clever efficiency and scaling hacks (like FlashAttention and Ring Attention) to scale up vanilla LLMs. Longer term, we may see growing interest in Mamba and perhaps other attention-free architectures. Or maybe someone will come up with a totally new architecture that renders transformers obsolete.


But I am pretty confident that scaling up transformer-based frontier models isn’t going to be a solution on its own. If we want models that can handle billions of tokens—and many people do—we’re going to need to think outside the box.

Tim Lee was on staff at Ars from 2017 to 2021. Last year, he launched a newsletter, Understanding AI, that explores how AI works and how it's changing our world. You can subscribe here.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689
The Dark Matter of AI - Welch Labs explains Mechanistic Interpretability

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689

Understanding Test Time Compute: How AI Systems Learn to Think in Real-Time






Understanding Test Time Compute: How AI Systems Learn to Think in Real-Time​

Artificial Intelligence, Featured, Interesting Papers

15 minutes

December 3, 2024

Audio Overview

Powered by Notebook LM

In the race for AI advancement, we’ve been asking the wrong question. While the world obsesses over “How do we make AI more powerful?”, a quiet revolution is taking place that asks: “How do we make AI think better?” Enter OpenAI’s o1 model, which demonstrates how rethinking computation itself might be more valuable than simply scaling it up.

The o1 model from Open AI stole the spotlight by showcasing an unprecedented capability: it meticulously lays out a step-by-step chain of reasoning before delivering its answers. Imagine having a problem-solving partner who not only provides the solution but also walks you through every logical step they took to get there. This advancement isn’t just about transparency; it’s about enhancing the quality of AI reasoning itself, pushing the boundaries in fields from scientific research to complex system design.

This shift highlights the significance of Test Time Compute (TTC), a set of strategies designed to boost how AI systems process information during inference. It’s not just about processing power – it’s about processing intelligence. This approach enables AI systems to dynamically refine their computational strategies during inference, leading to more nuanced and contextually appropriate responses.

In this blog, we’ll dive deeper into TTC, exploring its transformative techniques and why it’s crucial for shaping AI into adaptable, reasoning-driven systems for the future.

What is Test Time Compute?​



Definition and Key Principles


Test-time Compute (TTC) is a game-changer in AI, dynamically allocating computational resources during the inference phase to supercharge model performance. Unlike traditional methods, TTC enables models to perform additional computations for deeper reasoning and adapt to novel inputs in real-time. Think of it as giving your AI the ability to think on its feet, making it more versatile and effective.

Comparison with Traditional Inference


Traditional inference relies solely on pre-trained knowledge, delivering outputs instantaneously and prioritizing speed above all else. In contrast, TTC allows models to “think” during inference by allocating extra compute resources as needed for more challenging tasks. This adaptability bridges the gap between static, pre-trained models and dynamic, problem-solving systems, making AI more responsive and intelligent.

Evolution of TTC


Initially, inference techniques were all about delivering quick responses. However, as tasks became more complex, these methods struggled to provide accurate or nuanced outputs. TTC emerged as a solution, incorporating iterative processes and advanced strategies that enable models to evaluate multiple solutions systematically and select the best one.

The Rise of TTC​


Test-time Compute (TTC) is rapidly gaining traction in the development of advanced reasoning models. Cutting-edge AI models like OpenAI’s o1 and Alibaba’s QwQ-32B-Preview are leveraging TTC to enhance their reasoning capabilities. By allocating additional computational resources during inference, these models can tackle complex tasks more effectively. This approach allows them to “think” before responding, significantly improving performance in areas like mathematics, coding, and scientific problem-solving.

Bridging Training-Time and Inference-Time Optimization​


While TTC addresses many challenges, its true value bridges a fundamental gap in AI systems: the disconnect between training and inference.

The Gap Between Training and Inference

AI models learn from vast datasets during training, recognizing patterns and generalizing knowledge. However, they may encounter novel inputs outside their training distribution during inference. This mismatch can lead to errors or suboptimal performance, particularly in tasks requiring deep reasoning or context-specific adaptations.

TTC as the Bridge

TTC addresses this gap by enabling models to adapt dynamically during inference. For example, a speech recognition model encountering an unfamiliar accent can use TTC to refine its understanding in real time, producing more accurate transcriptions.

Research shows that applying compute-optimal scaling strategies—a core principle of TTC—can improve test-time efficiency by over fourfold compared to traditional methods, making TTC both a practical and scalable solution.

Techniques of Test Time Compute​


Think of Test Time Compute (TTC) as giving your AI a powerful set of thinking tools that it can use while solving problems, rather than just relying on what it learned during training. Let’s explore these fascinating techniques that are revolutionizing how AI systems reason and adapt in real-time.

1. Chain of Thought (CoT) Reasoning​



How it Works: Chain of Thought (CoT) Reasoning: Chain of Thought transforms AI reasoning by enabling models to decompose complex problems into explicit, verifiable steps—like a master logician revealing each link in their analytical chain. This transparency allows us to follow the model’s cognitive journey, turning the traditional black-box approach into a clear, methodical reasoning process.

Technical Mechanism: – Recursive prompt decomposition – Intermediate state tracking – Step validation mechanisms – Sequential reasoning path construction

Example:

Question
: “If Alice has twice as many books as Bob, and Bob has 15 books, how many do they have together?

Step 1: Calculate Alice’s books = 2 × Bob’s books = 2 × 15 = 30

Step 2: Total books = Alice’s books + Bob’s books = 30 + 15 = 45

Answer: They have 45 books together
ProsCons
Transparent reasoning processIncreased processing time
Better error detectionHigher computational overhead
Improved accuracy on complex tasks

Chain of Thought reasoning represents the simplest and most intuitive way of implementing Test Time Compute. It acts as a foundational framework for many advanced methods by enabling dynamic thinking and iterative refinement during inference. While CoT remains highly effective, ongoing research is exploring innovative approaches to enhance TTC further. We will explore some of the other approaches next in this section.

2. Filler Token Computation​



Image Courtesy Let’s Think Dot by Dot

How it Works: Filler Token Computation acts as a neural scaffold, strategically inserting computational markers during inference that help models navigate complex relationships. These temporary tokens create critical connection points in the model’s processing pipeline, enabling deeper understanding of dependencies and context—similar to how temporary supports enable the construction of complex architectural structures.

Technical Mechanism: – Dynamic token insertion – Context-aware placement – Relationship modeling – Temporary state management

Example:

Input
: “The cat sat on the mat”

Result: Deeper understanding of relationships and roles
ProsCons
Enhanced relationship processingMemory overhead
Improved context understandingProcessing complexity
Better handling of complex structuresToken management challenges

3. Adaptive Decoding Strategies​



How it Works: Adaptive Decoding Strategies function as an AI system’s dynamic control center, intelligently adjusting token generation probabilities in real-time to balance between creative exploration and precise outputs. Like a precision instrument auto-calibrating its settings, it continuously modulates the sampling temperature based on task requirements—tightening the distribution for factual accuracy or broadening it for creative tasks.

Technical Mechanism: – Dynamic sampling temperature – Probability distribution shaping – Token selection optimization – Context-aware generation

Example:

Task
: Generating product descriptions

Conservative: “A durable leather wallet with multiple card slots.”

Creative: “An artisanal leather masterpiece, thoughtfully crafted with precision-cut card sanctuaries.”
ProsCons
Controllable output styleParameter tuning complexity
Better context matchingPerformance overhead
Flexible creativity levelsBalance maintenance

4. Search Against Verifiers​

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689


Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

682,191 views Aug 27, 2024

For more information about Stanford's Artificial Intelligence programs visit: https://stanford.io/ai

This lecture provides a concise overview of building a ChatGPT-like model, covering both pretraining (language modeling) and post-training (SFT/RLHF). For each component, it explores common practices in data collection, algorithms, and evaluation methods. This guest lecture was delivered by Yann Dubois in Stanford’s CS229: Machine Learning course, in Summer 2024.

Yann Dubois PhD Student at Stanford https://yanndubs.github.io/

About the speaker: Yann Dubois is a fourth-year CS PhD student advised by Percy Liang and Tatsu Hashimoto. His research focuses on improving the effectiveness of AI when resources are scarce. Most recently, he has been part of the Alpaca team, working on training and evaluating language models more efficiently using other LLMs.

To view all online courses and programs offered by Stanford, visit: http://online.stanford.edu
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689


AI Agents Explained Like You're 5 (Seriously, Easiest Explanation Ever!)​


I’m breaking down AI agents in the simplest, most real-world way possible. No jargon, just a clear explanation that you will understand!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689
We Finally Figured Out How AI Actually Works… (not what we thought!)



Channel Info Matthew Berman Subscribers: 443K subscribers

Description
Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗
👉🏻 Subscribe: Matthew Berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: Join the Forward Future AI Discord Server!
👉🏻 Patreon: Get more from Matthew Berman on Patreon
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: Matthew Berman (@matthewberman_ai) • Threads, Say more
👉🏻 LinkedIn: Forward Future | LinkedIn

Media/Sponsorship Inquiries ✅

TimeStamps:
0:00 Intro
1:19 Paper Overview
5:46 How AI is Multilingual
8:07 How AI Plans
10:58 Mental Math
14:00 AI Makes Things Up
17:39 Multi-Step Reasoning
19:37 Hallucinations
22:44 Jailbreaks
25:30 Outro

Links:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689

How are AI Image Generation Models Built?​


Learn about AI Image Generation Models, how they work, and how are they built from scratch.

Rajat DangiMarch 28, 2025 · 9 min read

How are AI Image Generation Models Built?


Turn your images into Studio Ghibli style anime art using Animify Image Generator.

public



Key Takeaways​


  • AI image generation models, like those behind ChatGPT 4o and DALL-E, Google Gemini, Grok, and Midjourney, are built using advanced machine learning techniques, primarily diffusion models, with Grok using a unique autoregressive approach.
  • These models require vast datasets of images and text, powerful computing resources like GPUs, and expertise in machine learning and computer vision.
  • Building one from scratch involves collecting data, designing model architectures, and training them, which is resource-intensive and complex.



A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel. - ChatGPT 4o


Understanding AI Image Generation​


AI image generation has transformed how we create visual content, enabling tools like ChatGPT 4o, OpenAI DALL-E, Imagen by Google, Aurora by xAI, and Midjourney to produce photorealistic or artistic images from text descriptions. These models are at the heart of popular platforms, making it essential to understand their construction for both technical enthusiasts and out of curiousity.



Technologies Behind Popular Tools​



What It Takes To Build Image Generation Models from Scratch​


Creating an AI image generator involves:

  • Data Needs: Millions of image-text pairs, like those used for DALL-E, ensuring diversity for broad concept coverage.
  • Compute Power: Requires GPUs or TPUs for training, with costs in thousands of GPU hours.
  • Expertise: Knowledge in machine learning, computer vision, and natural language processing is crucial, alongside stable training techniques.
  • Challenges: Includes ethical concerns like bias prevention and high computational costs, with diffusion models offering stability over older GANs.

This process is complex, but understanding it highlights the innovation behind these tools, opening doors for future advancements.

Exploring Different AI Image Generation Models​


AI image generation has revolutionized creative industries, enabling the production of photorealistic and artistic images from textual prompts. Tools like DALL-E, Imagen, Aurora, and Midjourney have become household names, integrated into platforms like ChatGPT, Google Gemini, Grok, and Midjourney. This section delves into the technologies behind these models and the intricate process of building them from scratch, catering to both technical and non-technical audiences.



Popular AI Image Generators​


Several prominent AI image generators have emerged, each with distinct technological underpinnings:

  • DALL-E (OpenAI): Likely the backbone of ChatGPT's image generation, especially versions like ChatGPT 4o, DALL-E uses diffusion models. The research paper "Hierarchical Text-Conditional Image Generation with CLIP Latents" (Hierarchical Text-Conditional Image Generation with CLIP Latents) details DALL-E 2's architecture, which involves a prior generating CLIP image embeddings from text and a decoder using diffusion to create images. This model, with 3.5 billion parameters, enhances realism and resolution, integrated into ChatGPT for seamless user interaction.
  • Google Gemini (Imagen): Google Gemini leverages Imagen 3 for image generation, as noted in recent updates (Google Gemini updates: Custom Gems and improved image generation with Imagen 3). Imagen uses diffusion models, with the research paper "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding) describing its architecture. It employs a large frozen T5-XXL encoder for text and conditional diffusion models for image generation, achieving a COCO FID of 7.27, indicating high image-text alignment.
  • Grok (Aurora by xAI): Grok, developed by xAI, uses Aurora for image generation, as announced in the blog post "Grok Image Generation Release" (Grok Image Generation Release). Unlike others, Aurora is an autoregressive mixture-of-experts network, trained on interleaved text and image data to predict the next token, offering photorealistic rendering and multimodal input support. This approach, detailed in the post, contrasts with diffusion models, focusing on sequential prediction.
  • Midjourney: Midjourney, a generative AI program, uses diffusion models, as inferred from comparisons with Stable Diffusion and DALL-E (Midjourney - Wikipedia). While proprietary, industry analyses suggest it leverages diffusion for real-time image generation, known for artistic outputs and accessed via Discord or its website, entering open beta in July 2022.

These tools illustrate the diversity in approaches, with diffusion models dominating due to their quality, except for Grok's unique autoregressive method.

Breakdown of Technologies Behind AI Image Generation Models​


The core technologies driving these models include diffusion models, autoregressive models, and historical approaches like GANs and VAEs. Here's a deeper dive:



Diffusion Models: The State-of-the-Art.​


Diffusion models, as used in DALL-E, Imagen, and Midjourney, operate through a two-stage process:​


  • Forward Process: Gradually adds noise to an image over many steps, creating a sequence from a clear image to pure noise. This is akin to sculpting, where noise is like chiseling away marble to reveal the form.
  • Reverse Process: Trains a neural network, often a U-Net, to predict and remove noise at each step, starting from noise to generate a coherent image. For text-to-image, text embeddings guide this process, ensuring the image aligns with the prompt.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689
The architecture, as seen in Imagen, involves a text encoder (e.g., T5-XXL) and conditional diffusion models, with upsampling stages (64×64 to 1024×1024) using super-resolution diffusion models. DALL-E 2's decoder modifies Nichol et al.'s (2021) diffusion model, adding CLIP embeddings for guidance, with training details in Table 3 from the paper:

ModelDiffusion StepsNoise ScheduleSampling StepsSampling Variance MethodModel SizeChannelsDepthChannels MultipleHeads ChannelsAttention ResolutionText Encoder ContextText Encoder WidthText Encoder DepthText Encoder HeadsLatent Decoder ContextLatent Decoder WidthLatent Decoder DepthLatent Decoder HeadsDropoutWeight DecayBatch SizeIterationsLearning RateAdam β2\beta_2β2Adam ϵ\epsilonϵEMA Decay
AR prior----1B-----2562048243238416642426-4.0e-240961M1.6e-40.911.0e-100.999
Diffusion prior1000cosine64analytic [2]1B-----25620482432-----6.0e-24096600K1.1e-40.961.0e-60.9999
64→256 Upsampler1000cosine27DDIM [47]700M32031,2,3,4----------0.1-10241M1.2e-40.9991.0e-80.9999
256→1024 Upsampler1000linear15DDIM [47]300M19221,1,2,2,4,4------------5121M1.0e-40.9991.0e-80.9999

This table highlights hyperparameters, showing the computational intensity, with batch sizes up to 4096 and iterations in the millions.



Autoregressive Models: Sequential Prediction​


Grok's Aurora uses an autoregressive approach, predicting image tokens sequentially, akin to writing a story word by word. The xAI blog post describes it as a mixture-of-experts network, trained on billions of internet examples, excelling in photorealistic rendering. This method, detailed in the release, contrasts with diffusion by generating images part by part, potentially slower but offering unique capabilities like editing user-provided images.



Historical Approaches: GANs and VAEs​


GANs, with a generator and discriminator competing, and VAEs, encoding images into latent spaces for decoding, were early methods. However, diffusion models, as noted in Imagen's research, outperform them in fidelity and diversity, making them less common in current state-of-the-art systems.

How to Build an AI Image Generator from Scratch?​


Constructing an AI image generator from scratch is a monumental task, requiring:

  1. Data Requirements:
  2. Computational Resources:
    • Training demands powerful GPUs or TPUs, with costs in thousands of GPU hours, reflecting the scale seen in DALL-E and Imagen. Infrastructure for distributed training, as implied in the papers, is crucial for handling large-scale data.
  3. Model Architecture:
    • For diffusion models, implement U-Net architectures, as in Imagen, with text conditioning via large language models. For autoregressive, use transformers, as in Aurora, handling sequential token prediction. The choice depends on desired output quality and speed.
  4. Training Process:
    • Data Preprocessing: Clean datasets, tokenize text, and resize images for uniformity, ensuring compatibility with model inputs.
    • Model Initialization: Leverage pre-trained models, like T5 for text encoding, to reduce training time, as seen in Imagen.
    • Optimization: Use advanced techniques, with learning rates and batch sizes from Table 3, ensuring stable convergence, especially for diffusion models.
  5. Challenges and Considerations:
    • Training Stability: Diffusion models, while stable, require careful tuning, unlike GANs prone to mode collapse. Ethical concerns, as noted in DALL-E's safety mitigations (DALL·E 2), include filtering harmful content and monitoring bias.
    • Compute Costs: High energy and hardware costs, with environmental impacts, are significant, necessitating efficient architectures like Imagen's Efficient U-Net.
    • Expertise Needed: Requires deep knowledge in machine learning, computer vision, and natural language processing, with skills in handling large-scale training pipelines.

This process, while feasible with resources, underscores the complexity, with open-source alternatives like Stable Diffusion offering starting points for enthusiasts.

Conclusion​


AI image generation, dominated by diffusion models, with Grok's autoregressive approach adding diversity, showcases technological innovation. Building from scratch demands significant data, compute, and expertise, highlighting the barriers to entry. As research progresses, expect advancements in efficiency, ethics, and multimodal capabilities, further blurring human-machine creative boundaries.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689












1/13
@buildthatidea
Anthropic recently dropped fascinating research on how models like Claude actually think and work.

It's one of the most important research papers of 2025

Here are my 7 favorite insights 🧵



Gnhx2ijW4AEsAYf.jpg


2/13
@buildthatidea
1/ Claude plans ahead when writing poetry!

It identifies potential rhyming words before writing a line, then constructs sentences to reach those words.

It's not just predicting one word at a time.



GnhxXVuXMAAWLV7.jpg


3/13
@buildthatidea
2/ Claude has a universal language of thought

When processing questions in English, French, or Chinese, it activates the same internal features for concepts

It then translates these concepts into the specific language requested.



Gnhxa15W0AA2EWd.jpg


4/13
@buildthatidea
3/ Claude does mental math using parallel computational paths

One path is for approximation, another for precise digit calculation.

But when asked how it solved the problem, it describes using standard algorithms humans use



GnhxR4YWwAArscU.jpg


5/13
@buildthatidea
4/ Claude can “fake” its reasoning

When solving a math problem, it might give a full chain of thought that sounds correct

But inside, it never actually did the math

It just made up the steps to sound helpful



GnhxO95WsAEoYu2.jpg


6/13
@buildthatidea
5/ It performs multi-step real reasoning

When solving complex questions like "What's the capital where Dallas is located?", Claude actually follows distinct reasoning steps: first activating "Dallas is in Texas" and then "capital of Texas is Austin."

It's not just memorization



GnhxIfBXQAA7uQ9.jpg


7/13
@buildthatidea
6/ Claude defaults to not answering when unsure

Claude’s instinct is to say “I don’t know.”

For known entities (like Michael Jordan), a "known entity" feature inhibits this refusal. But sometimes this feature misfires, causing hallucinations.



GnhxDqmXkAAyfli.png


8/13
@buildthatidea
7/ Jailbreaks work by exploiting conflicts inside Claude

In one example Claude was tricked into writing something dangerous like BOMB

It continued only because it wanted to finish a grammatically correct sentence

Once that was done it reverted to safety and refused to continue



Gnhw09VXgAAcJus.jpg


9/13
@buildthatidea
8/ Why this matters

- Better understanding of AI leads to better safety
- We can catch fake logic
- Prevent hallucinations
- Understand when and how reasoning happens
- And make more trustworthy systems



10/13
@buildthatidea
If you’re building, researching, or just curious about how language models work, this is a must-read.

Read it here: https://www.anthropic.com/research/tracing-thoughts-language-model



11/13
@buildthatidea
Want to build AI apps?

With BuildThatIdea - Launch GPT Wrappers in 60 Seconds!, you can build and monetize AI apps in 60 seconds

Sign up here:



12/13
@buildthatidea
That's a wrap ✨ Hope you enjoyed it

If you found this thread valuable:

1. Follow @0xmetaschool for more
2. Retweet the first tweet so more people can see it

[Quoted tweet]
Anthropic recently dropped fascinating research on how models like Claude actually think and work.

It's one of the most important research papers of 2025

Here are my 7 favorite insights 🧵


Gnhx2ijW4AEsAYf.jpg


13/13
@KatsDojo
Gonna deff be sharing this with the team ty always! 🫂




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
63,093
Reputation
9,641
Daps
172,689
[Discussion] New Study shows Reasoning Models are more than just Pattern-Matchers



Posted on Thu Apr 10 16:55:06 2025 UTC

/r/ArtificialInteligence/comments/1jw2qnv/new_study_shows_reasoning_models_are_more_than/

A new study (Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning) conducted experiments on coding tasks to see if reasoning models performed better on out-of-distribution tasks compared to non-reasoning models. They found that reasoning models showed no drop in performance going from in-distribution to out-of-distribution (OOD) coding tasks, while non-reasoning models do. Essentially, they showed that reasoning models, unlike non-reasoning models, are more than just pattern-matchers as they can generalize beyond their training distribution.

We might have to rethink the way we look at LLMs overfit models to the whole web, but rather as models with actual useful and generalizable concepts of the world now.
 
Top