How LLMs Work

November 26, 2024 (1mo ago)

27 min read

Large Language Models (LLMs) like GPT are advanced AI systems designed to process and generate human-like text. Here's a comprehensive breakdown of their inner workings:

1. Tokenization: Splitting Text into Manageable Units

Tokenization is the first critical step where input text is split into manageable units called tokens. Modern LLMs use sophisticated subword tokenization techniques.

Modern LLMs often use Byte Pair Encoding (BPE) or WordPiece tokenization techniques. These break down rare words into frequent subwords, balancing vocabulary size and coverage.

Input: "Why is the sky blue?"
Tokens: ["Why", "is", "the", "sky", "blue", "?"]

Code Example:

Using Python's transformers library:

from transformers import AutoTokenizer
import traceback

try:
  tokenizer = AutoTokenizer.from_pretrained("gpt2")

  text = "Understanding tokenization is fascinating!"
  tokens = tokenizer.tokenize(text)
  token_ids = tokenizer.convert_tokens_to_ids(tokens)

  print("Tokens:", tokens)
  print("Token IDs:", token_ids)
  print("Decoded back:", tokenizer.decode(token_ids))
except Exception as e:
  print(f"Error in tokenization: {e}")
  traceback.print_exc()

Key Tokenization Insights:

  • Handles out-of-vocabulary words through subword splitting
  • Typical vocabulary sizes range from 30,000 to 50,000 tokens
  • Preserves semantic meaning while managing computational complexity
  • Breaks down complex words into meaningful subunits

2. Embeddings: Transforming Tokens into Numerical Representations

Tokens are transformed into embeddings — dense vectors of numbers that represent their meaning. This numerical representation enables the model to process linguistic information mathematically.

Key Facts:

  • Dimensionality: Typically 768 or 1024 dimensions.
  • Semantic Understanding: Embeddings capture relationships between words (e.g., "king" and "queen").
  • Continuous Space: Words with similar meanings have closer embeddings.

Example Embedding Visualization:

from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

import sys
import traceback

try:
    old_stdout = sys.stdout
    sys.stdout = open('debug_output.txt', 'w')

    model = AutoModel.from_pretrained("gpt2")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token

    words = [
        "king", "queen", "prince", "princess",
        "man", "woman", "boy", "girl",
        "royal", "noble", "crown", "throne",
        "leader", "ruler", "monarch", "sovereign"
    ]

    inputs = tokenizer(words, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        outputs = model(**inputs)

        word_embeddings = []
        for i in range(len(words)):
            word_tokens = inputs.input_ids[i]
            word_tokens = word_tokens[word_tokens != tokenizer.pad_token_id]

            word_embedding = outputs.last_hidden_state[i, :len(word_tokens), :].mean(dim=0)
            word_embeddings.append(word_embedding)

        word_embeddings = torch.stack(word_embeddings).numpy()

    scaler = StandardScaler()
    word_embeddings_scaled = scaler.fit_transform(word_embeddings)

    tsne = TSNE(
        n_components=2,
        random_state=42,
        perplexity=min(5, len(words)-1)
    )
    embeddings_2d = tsne.fit_transform(word_embeddings_scaled)

    sys.stdout.close()
    sys.stdout = old_stdout

    plt.figure(figsize=(12, 10))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.7)

    for i, word in enumerate(words):
        plt.annotate(
            word,
            (embeddings_2d[i, 0], embeddings_2d[i, 1]),
            xytext=(5, 5),
            textcoords='offset points',
            fontsize=9,
            alpha=0.7
        )

    plt.title("Word Embeddings Visualization")
    plt.xlabel("t-SNE Dimension 1")
    plt.ylabel("t-SNE Dimension 2")
    plt.tight_layout()
    plt.grid(True, linestyle='--', linewidth=0.5)

    plt.savefig("word_embeddings.png", dpi=300)
    plt.show()

except Exception as e:
    sys.stdout = old_stdout
    print(f"Unexpected error: {e}")
    traceback.print_exc()

try:
    with open('debug_output.txt', 'r') as f:
        debug_content = f.read()
        if debug_content:
            print("Debug Output:")
            print(debug_content)
except FileNotFoundError:
    pass
TokenEmbedding (Simplified)
Why[0.0821, -0.3456, 0.8532, ..., -0.2451] (768 values)
is[0.4123, -0.1523, 0.9012, ..., 0.3245] (768 values)
sky[0.2534, -0.1289, 0.6545, ..., -0.1234] (768 values)
blue[0.3378, -0.2134, 0.7289, ..., 0.4567] (768 values)
?[0.0534, -0.0378, 0.8045, ..., -0.1890] (768 values)

3. Attention Mechanism: Contextual Understanding

The transformer architecture's core innovation is the self-attention mechanism, which allows models to understand contextual relationships between tokens.

Attention Types:

  • Self-Attention: Analyzes relationships within the same sequence
  • Cross-Attention: Compares tokens across different sequences
  • Multi-Head Attention: Allows simultaneous attention from different representation subspaces

Attention Visualization:

from transformers import AutoModel, AutoTokenizer
import torch
import matplotlib.pyplot as plt
import numpy as np

try:
    model = AutoModel.from_pretrained("gpt2", output_attentions=True, attn_implementation="eager")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    text = "The cat sat on the comfortable mat"
    inputs = tokenizer(text, return_tensors="pt")

    outputs = model(**inputs, output_attentions=True)

    attention_weights = outputs.attentions[-1].squeeze()

    averaged_attention = attention_weights.mean(dim=0)

    plt.figure(figsize=(10, 8))
    im = plt.imshow(averaged_attention.detach().numpy(), cmap='viridis')
    plt.colorbar(im)

    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    plt.xticks(range(len(tokens)), tokens, rotation=45)
    plt.yticks(range(len(tokens)), tokens)

    plt.title("Multi-Head Attention Heatmap")
    plt.tight_layout()

    plt.show()

    plt.savefig("attention_heatmap.png")
    plt.close()

except Exception as e:
    print(f"Attention visualization error: {e}")

4. Model Layers: Refining Understanding

Transformer models typically consist of multiple layers, each refining the token representations:

  • Standard models have 12-48 transformer layers
  • Each layer includes:
    • Multi-head self-attention mechanism
    • Feedforward neural networks
    • Layer normalization
    • Residual connections

Layer Processing Example:

from transformers import AutoModelForCausalLM

try:
    model = AutoModelForCausalLM.from_pretrained("gpt2")

    print("Number of Layers:", model.config.num_hidden_layers)
    print("Hidden Size:", model.config.hidden_size)
except Exception as e:
    print(f"Model loading error: {e}")

5. Decoding: Generating Intelligent Responses

LLMs use advanced decoding strategies to generate coherent text:

Decoding Strategies:

  • Greedy Search: Selects most probable token
  • Beam Search: Explores multiple token sequences
  • Contrastive Search: Encourages diverse generations
  • Top-k Sampling: Randomly selects from top k probable tokens
  • Top-p (Nucleus) Sampling: Selects from smallest set of tokens with cumulative probability > p
  • Temperature Sampling: Controls randomness of token selection

The previous examples can be expanded upon, as there are numerous strategies available.

Generation Example:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

try:
    torch.manual_seed(42)

    model = AutoModelForCausalLM.from_pretrained("gpt2")
    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    prompt = "The future of artificial intelligence"
    inputs = tokenizer(prompt, return_tensors="pt")

    pad_token_id = tokenizer.eos_token_id

    outputs_greedy = model.generate(
        inputs.input_ids,
        max_length=50,
        num_return_sequences=1,
        do_sample=False,
        pad_token_id=pad_token_id
    )

    outputs_diverse = model.generate(
        inputs.input_ids,
        max_length=50,
        num_return_sequences=3,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        pad_token_id=pad_token_id
    )

    print("Greedy Search:", tokenizer.decode(outputs_greedy[0], skip_special_tokens=True))
    for i, output in enumerate(outputs_diverse):
        print(f"Diverse Sampling {i + 1}:", tokenizer.decode(output, skip_special_tokens=True))

except Exception as e:
    print(f"Generation error: {e}")

Ethical Considerations and Limitations

Training large-scale LLMs can result in significant energy consumption. Efforts to optimize training pipelines and use renewable energy sources can mitigate this impact.

While LLMs are powerful, they have important limitations:

  • Can generate factually incorrect information
  • Potential for bias from training data
  • Limited true understanding of context
  • Computational and environmental costs
  • Challenges with long-term coherence and reasoning

Summary: LLM Processing Pipeline

StepActionKey Output
1. TokenizationSplit input into tokensDiscrete linguistic units
2. EmbeddingConvert tokens to vectorsNumerical representations
3. AttentionAnalyze token relationshipsContextual understanding
4. Layer ProcessingRefine token representationsEnhanced semantic meaning
5. DecodingGenerate response tokensHuman-like text output
  • Transformer Architecture Research Papers
  • Ethical Considerations in AI Language Models
  • Advanced Natural Language Processing Techniques
  • Machine Learning Model Interpretability

Conclusion

LLMs are complex systems that process text through a series of steps, from tokenization to decoding. Understanding these steps provides insight into how these models work and generate responses.