Wikifreedia

Large Language Models (LLM's)

All versions

= Large Language Models (LLMs) :author: The AI Research Community :revnumber: 2025 :revdate: 2026-06-14 :doctype: article :lang: en

== Overview

A Large Language Model (LLM) is a type of nostr:naddr1qvzqqqrcvgpzpdwnfmku0k8crn42tmfh0ddz656wa9yl4cx94va6uzjhd2n5wh9lqqtkzun5d9nxjcmfv9kz66tww3jkcmrfvajkucm9xjg9sw[artificial intelligence] model trained on vast amounts of text data to understand, generate, and manipulate human language. LLMs are deep learning neural networks with billions or even trillions of parameters, enabling them to perform a wide range of natural language processing tasks including text generation, translation, summarization, question answering, and code writing.

Unlike traditional language models that might predict the next word in a sentence, LLMs exhibit emergent abilities—capabilities not explicitly trained for but arising from scale—such as reasoning, in-context learning, and instruction following. Prominent examples include GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google), Llama (Meta), and DeepSeek (DeepSeek AI).

== History

=== Pre-Transformer Era (Before 2017)

Language modeling predates modern deep learning. Early approaches included:

  • N-gram models – Statistical models predicting the next word based on previous N-1 words. Simple but suffered from data sparsity and limited context.
  • Recurrent Neural Networks (RNNs) – Processed sequences one token at a time, maintaining a hidden state. Struggled with long-range dependencies due to vanishing gradients.
  • LSTMs and GRUs – Improved RNNs with gating mechanisms, enabling longer context but still fundamentally sequential.
  • Word embeddings – Techniques like Word2Vec (2013) and GloVe (2014) learned dense vector representations for words, capturing semantic relationships.

The dominant approach for years was to train task-specific models from scratch: one model for sentiment analysis, another for translation, another for question answering.

=== The Transformer Revolution (2017)

In June 2017, Google researchers published the seminal paper “Attention Is All You Need”, introducing the Transformer architecture. The key innovation was the self-attention mechanism, which allows the model to weigh the importance of all tokens in a sequence simultaneously, rather than processing sequentially.

The Transformer paper demonstrated that: - Parallel processing (not sequential) dramatically accelerated training - Long-range dependencies could be captured effectively - Scaling up parameters and data led to consistent improvements

This architecture became the foundation for virtually all subsequent LLMs.

=== The “Pre-train, Fine-tune” Paradigm (2018-2019)

GPT-1 (OpenAI, June 2018) demonstrated that a Transformer decoder could be pre-trained on unlabeled text and then fine-tuned on specific tasks with minimal task-specific architecture changes.

BERT (Google, October 2018) used a Transformer encoder and introduced masked language modeling (predicting randomly masked tokens), achieving state-of-the-art results on 11 NLP tasks.

GPT-2 (OpenAI, February 2019) scaled up to 1.5 billion parameters and demonstrated impressive zero-shot capabilities—the ability to perform tasks without explicit fine-tuning.

=== Scaling Laws and Emergent Abilities (2020-2022)

Scaling Laws (Kaplan et al., 2020) established that model performance follows predictable power-law relationships with compute budget, dataset size, and model size. This led to a “bigger is better” race.

GPT-3 (OpenAI, June 2020) scaled to 175 billion parameters and demonstrated in-context learning – the ability to learn new tasks from examples provided within the prompt without any weight updates.

Key emergent abilities observed as models scaled:

  • Instruction following – Responding to natural language commands
  • Chain-of-thought reasoning – Solving multi-step problems by showing work
  • Code generation – Writing and explaining computer programs
  • Translation – Between languages not explicitly paired in training
  • Basic arithmetic – Without being explicitly trained on math

=== The Instruction Tuning Era (2022-2023)

GPT-3 would often generate helpful content but could also produce harmful, misleading, or unhelpful outputs. Instruction tuning – fine-tuning on human-written demonstrations of desired behavior – emerged as a solution.

InstructGPT / ChatGPT (OpenAI, November 2022) popularized Reinforcement Learning from Human Feedback (RLHF), where human raters rank model outputs to train a reward model, then fine-tune via reinforcement learning.

ChatGPT’s release marked a watershed moment, bringing LLMs to mainstream public awareness. It reached 100 million users within two months.

=== The Open-Source and Efficiency Era (2023-Present)

Llama 1 & 2 (Meta, 2023) made capable LLMs available for research and commercial use, democratizing access.

Mistral (2023) demonstrated that smaller models (7B-13B parameters) could rival much larger models with better training techniques.

DeepSeek-V3 (December 2024) achieved GPT-4-level performance for approximately $5.6 million in training costs, using Mixture-of-Experts (MoE) and algorithmic innovations to dramatically reduce compute requirements.

DeepSeek-R1 (January 2025) pioneered open models with explicit reasoning transparency, showing chain-of-thought to users.

The “Inference Time Compute” Shift (2025-2026) – A growing realization that scaling inference-time compute (allowing models to “think longer”) could be as valuable as scaling training compute. OpenAI’s o1 series and Anthropic’s Claude Opus with configurable thinking budgets exemplify this trend.

=== The Current Landscape (2026)

LLMs have become a core technology stack, integrated into search engines, productivity software, developer tools, and countless applications. Key trends include:

  • Mixture-of-Experts (MoE) architectures for efficient scaling
  • Reasoning models with visible chain-of-thought
  • Multimodality (text, image, audio, video)
  • Long context windows (up to 1-2 million tokens)
  • Agentic capabilities (autonomous planning and tool use)
  • Open-weight models competing with closed-source leaders

== Architecture and Inner Workings

=== The Transformer Core

All modern LLMs use variants of the Transformer architecture:

[source,text] –– Input Text → Tokenization → Embedding + Positional Encoding ↓ [Self-Attention Layer] ↓ [Feed-Forward Network] ↓ [Normalization & Residual Connections] ↓ (Repeat 32-120 times) ↓ [Output Probabilities] → Generated Text ––

Tokenization – Text is converted to tokens (subword units, typically ~30,000-100,000 unique tokens in the vocabulary).

Embedding layer – Each token ID maps to a dense vector (e.g., 4,096 to 16,384 dimensions).

Positional encoding – Since self-attention is permutation-invariant, positional information is added to embeddings.

Self-attention – For each token, the model computes attention scores to all tokens in the context, weighting contributions of different tokens. This allows the model to refer to relevant context regardless of distance.

Feed-forward networks (FFNs) – Per-position neural networks that process the attention output.

Residual connections – Allow gradients to flow directly through the network, enabling effective training of very deep models.

Layer normalization – Stabilizes training and improves convergence.

=== Key Architectural Variations

Decoder-only (GPT family, Llama, Claude, DeepSeek) – Autoregressive models that generate text token-by-token, each new token attending to previous tokens. Dominant for generation tasks.

Encoder-only (BERT, RoBERTa) – Use bidirectional attention, producing contextualized representations. Best for classification and understanding tasks.

Encoder-decoder (T5, BART) – Combine both, useful for sequence-to-sequence tasks like translation and summarization.

Mixture-of-Experts (MoE) – Only a subset of parameters (“experts”) are activated per token, reducing compute costs while maintaining large total parameter counts. DeepSeek-V3 (671B total, 37B active per token) exemplifies this approach.

Attention variants – Multi-Query Attention, Grouped-Query Attention, and Multi-Head Latent Attention (MLA, developed by DeepSeek) reduce memory overhead.

=== Training Stages

Stage 1: Pre-training

The model processes massive text corpora (trillions of tokens) with a simple objective: predict the next token given previous tokens (causal language modeling) or predict masked tokens (masked language modeling).

Data sources typically include: Common Crawl (web pages), books, academic papers, code repositories, Wikipedia, social media, and news articles.

Training cost: A 70B parameter model may require 1,000-3,000 GPUs running for months, costing $5-50 million. Smaller open models can be trained for $100,000-$500,000.

Stage 2: Fine-tuning

Supervised Fine-Tuning (SFT) – Training on human-written demonstrations of desired behavior (e.g., instruction-response pairs).

Reinforcement Learning from Human Feedback (RLHF) – Human raters rank multiple model outputs; a reward model learns to predict human preferences; the LLM is fine-tuned via reinforcement learning (typically PPO or DPO) to maximize reward.

Constitutional AI (Anthropic) – Uses AI feedback rather than human feedback, with a “constitution” defining ethical principles.

Direct Preference Optimization (DPO) – A simpler alternative to RLHF that optimizes directly from preference data without a separate reward model.

Quantization – Reducing precision (e.g., from 16-bit to 4-bit or 8-bit) to enable deployment on consumer hardware.

== Emergent Abilities

As models scale beyond certain parameter thresholds, previously unseen abilities “emerge” – a phenomenon not predicted by simple scaling laws but observed empirically:

  • In-context learning – Learning new tasks from examples in the prompt without parameter updates (emerges around 10B-100B parameters)

  • Chain-of-thought reasoning – Solving multi-step problems by generating intermediate reasoning steps (emerges around 100B parameters)

  • Instruction following – Understanding and executing natural language commands (enabled by instruction tuning, not raw scale alone)

  • Code understanding and generation – Writing syntactically correct code across multiple languages

  • Theory of mind – Inferring others’ mental states, beliefs, and intentions

  • Basic math and symbolic reasoning – Arithmetic, algebra, logic puzzles

  • Translation – Between languages, including low-resource languages

  • Summarization – Distilling long documents to key points

  • Question answering – Open-domain factual recall

== Evaluation and Benchmarks

LLMs are evaluated on hundreds of benchmarks measuring different capabilities:

=== Knowledge and Understanding

  • MMLU (Massive Multitask Language Understanding) – Multiple-choice questions across 57 subjects (STEM, humanities, social sciences)
  • MMLU-Pro – Harder version with more challenging questions
  • ARC (AI2 Reasoning Challenge) – Grade-school science questions requiring reasoning
  • GPQA (Graduate-Level Google-Proof Q&A) – Expert-level questions in biology, physics, chemistry

=== Reasoning

  • GSM8K – Grade-school math word problems (8,000 examples)
  • MATH – Competition-level math problems across 5 difficulty levels
  • HumanEval – Python code generation correctness
  • SWE-bench – Real-world software engineering tasks
  • BIG-bench – 200+ diverse reasoning tasks

=== Safety and Alignment

  • TruthfulQA – Measuring truthfulness and avoiding falsehoods
  • Safety benchmarks – Measuring refusal of harmful requests
  • Bias and fairness – Stereotyping and discrimination metrics
  • Hallucination detection – Measuring factuality

=== Capability Aggregates

  • MMLU benchmark – One of the most widely cited scores
  • Artificial Analysis Intelligence Index – Aggregated capability measure
  • LMSys Chatbot Arena – Elo-based human preference ranking

== Notable LLMs

|=== | Model | Organization | Parameters | Key Innovations

| GPT-3 (2020) | OpenAI | 175B | In-context learning at scale

| BERT (2018) | Google | 340M | Bidirectional, masked LM

| T5 (2019) | Google | 11B | Unified text-to-text framework

| Gopher (2021) | DeepMind | 280B | Scaling study

| Chinchilla (2022) | DeepMind | 70B | Optimal scaling laws (more data, not just size)

| PaLM (2022) | Google | 540B | Pathways architecture

| Llama 1/2/3 (2023-2025) | Meta | 7B-405B | Open weights, strong efficiency

| GPT-4 (2023) | OpenAI | ~1.8T (reportedly) | Multimodal (images), RLHF

| Claude 3 Opus (2024) | Anthropic | Undisclosed | Constitutional AI, long context

| Gemini 1.5/2.0 (2024-2025) | Google | Undisclosed | Up to 2M context, native multimodality

| Mistral 7B/8x7B (2023-2024) | Mistral AI | 7B-12B | Small, efficient, open

| DeepSeek-V3 (2024) | DeepSeek AI | 671B (37B active) | MoE efficiency, $5.6M training

| DeepSeek-R1 (2025) | DeepSeek AI | 671B | Visible reasoning, reinforcement learning

| GPT-5.x (2025-2026) | OpenAI | Undisclosed | “Reasoning” models with extended thinking

| Claude Opus 4.8 (2026) | Anthropic | Undisclosed | Agentic coding, configurable thinking |===

== Training Infrastructure and Costs

=== Hardware Requirements

Training a 70B parameter model typically requires: - 1,000-3,000 GPUs (Nvidia H100 or comparable) - 30-90 days of continuous training - ~1-5 ExaFLOPs of compute (10^18 floating point operations)

Cost breakdown (estimate for a 70B model in 2024): - GPU cluster rental: $5-15 million - Data storage and networking: $500,000-$1 million - Engineering and research staff: $2-5 million - Total: $7-20 million

DeepSeek-V3 achieved comparable performance to GPT-4 for an estimated $5.6 million, demonstrating that algorithmic innovation can dramatically reduce costs. Key factors in their efficiency: - Mixture-of-Experts (only 37B of 671B parameters active per token) - FP8 mixed precision training - Optimized pipeline parallelism - Elimination of auxiliary losses for load balancing

=== Inference Costs

Running LLMs at scale (e.g., ChatGPT serving millions of users) requires substantial infrastructure. Inference costs are measured per token, with typical API pricing:

|=== | Model | Input Price (per 1M tokens) | Output Price

| GPT-4o mini | $0.15 | $0.60

| DeepSeek-V3 | $0.27 | $1.10

| Claude Sonnet | $3.00 | $15.00

| GPT-4o | $2.50 | $10.00

| Claude Opus | $5.00 | $25.00

| GPT-o1 | $15.00 | $60.00 |===

Techniques to reduce inference costs include: - Quantization (4-bit, 8-bit) - Speculative decoding – Using a small draft model to predict multiple tokens - KV caching – Reusing attention computations - Batch processing – Combining multiple requests

== Limitations and Challenges

=== Hallucination

LLMs frequently generate information that is factually incorrect, invented, or inconsistent with their training data. Unlike humans, they cannot reliably distinguish what they “know” from what they “invent.” Hallucination rates in leading models range from 26% (Claude Haiku Thinking) to 58% (Claude Opus).

=== Reasoning Limitations

Despite chain-of-thought capabilities, LLMs struggle with: - Genuine mathematical reasoning (vs. pattern matching) - Long-horizon planning (many interdependent steps) - Counterfactual reasoning - Physical common sense

=== Context Management

While context windows now reach millions of tokens, models often lose track of relevant information in the middle of long contexts (“lost in the middle” phenomenon) and may not effectively utilize all available context.

=== Catastrophic Forgetting

When fine-tuned for new capabilities, models may lose previously learned abilities. Mitigation strategies include replaying original training data and using LoRA (Low-Rank Adaptation) for targeted modifications.

=== Computational Demands

Training and running LLMs requires enormous energy. Training GPT-3 consumed approximately 1,300 MWh (equivalent to 130 US households for a year). Inference at ChatGPT scale adds substantial ongoing energy consumption.

=== Bias and Fairness

LLMs can perpetuate and amplify biases present in training data, producing stereotyped or prejudiced outputs. Mitigation requires careful data curation, debiasing techniques, and ongoing monitoring.

=== Security Vulnerabilities

  • Prompt injection – Users craft inputs that override system instructions
  • Jailbreaks – Bypass safety guardrails with creative prompting
  • Data poisoning – Malicious training data corrupts model behavior
  • Model extraction – Attackers infer model parameters through queries

== Example Snippets

=== Basic Text Generation (Conceptual)

[source,python] –– # Conceptual example - actual implementations vary by model def generatetext(model, prompt, maxtokens=100): ””” Simple autoregressive generation loop ””” tokens = tokenize(prompt)

for _ in range(max_tokens): # Model predicts logits for next token logits = model.forward(tokens)

Convert to probabilities probs = softmax(logits)

Sample next token (temperature-controlled) next_token = sample(probs, temperature=0.8)

tokens.append(next_token)

Stop if model predicts end-of-sequence if nexttoken == EOSTOKEN: break

return detokenize(tokens) ––

=== Prompting Techniques

Zero-shot prompting [source,text] –– Classify the sentiment of this review: “I absolutely loved this movie!” Output: Positive ––

Few-shot prompting (in-context learning) [source,text] –– Review: “This product broke after one use.” Sentiment: Negative Review: “Works perfectly, very satisfied.” Sentiment: Positive Review: “It’s okay, nothing special.” Sentiment: Neutral Review: “Best purchase I’ve ever made!” Sentiment: ––

Chain-of-thought (CoT) prompting [source,text] –– Question: If I have 3 apples and buy 2 more, but then give away half, how many do I have? Let’s think step by step: 1. Start with 3 apples 2. Buy 2 more: 3 + 2 = 5 apples 3. Give away half: 5 ÷ 2 = 2.5 apples Answer: 2.5 apples ––

Self-consistency CoT – Sample multiple reasoning chains and take the majority answer.

Tree-of-Thoughts (ToT) – Explore multiple reasoning branches simultaneously, backtracking when branches fail.

== Ethical and Societal Implications

=== Risks

  • Misinformation generation – Creating convincing fake news, reviews, or social media content
  • Automated disinformation campaigns – Scaling propaganda and manipulation
  • Job displacement – Affecting writing, customer service, translation, programming, design
  • Educational integrity – Students submitting AI-generated work as their own
  • Privacy violations – Models may memorize and regurgitate personal information
  • Copyright infringement – Training on copyrighted works without permission
  • Manipulation and persuasion – Personalized influence at scale
  • Dual-use risks – Assisting with biological, chemical, or cybersecurity attacks

=== Benefits

  • Democratized knowledge – Making expertise accessible through natural conversation
  • Productivity gains – Automating routine cognitive work
  • Accessibility – Enabling communication for those with disabilities
  • Education – Personalized tutoring available 24/7
  • Scientific acceleration – Literature review, hypothesis generation, code writing
  • Creative assistance – Brainstorming, drafting, iteration
  • Language preservation – Documenting and translating endangered languages
  • Healthcare support – Medical literature analysis, clinical decision support

=== Mitigation Approaches

  • RLHF and Constitutional AI – Aligning with human preferences and principles
  • Red teaming – Adversarial testing before release
  • External audits – Independent safety evaluations
  • Watermarking – Identifying AI-generated content
  • Rate limiting and monitoring – Detecting misuse patterns
  • Open-weight scrutiny – Public inspection enabling security research (but also enabling misuse)
  • Regulation – Government frameworks (e.g., EU AI Act)

== Future Directions

  • Test-time compute scaling – Trading inference time for improved reasoning (already emerging with OpenAI o1 and Claude Opus 4)

  • Agentic systems – LLMs that plan, use tools, execute code, and self-correct over multiple steps

  • Multimodality – Seamless integration of text, image, audio, video, possibly other modalities (sensor data, scientific data)

  • Longer contexts – Moving from 1 million to 10 million+ tokens, enabling processing of entire codebases, libraries, or personal histories

  • Efficiency – Continued algorithmic improvements reducing training and inference costs (DeepSeek’s $5.6M training is a preview)

  • Neuromorphic and specialized hardware – AI chips designed specifically for Transformer inference

  • Retrieval-augmented generation (RAG) – Better integration with external knowledge bases to reduce hallucination

  • Memory and continual learning – Models that retain information across sessions without retraining

  • Verification and fact-checking – Self-consistency checks, external tool use (search, calculators, code execution)

  • Specialized LLMs – Models optimized for specific domains (medicine, law, science, engineering)

  • Local LLMs – Powerful models running on consumer devices (laptops, phones)

  • Quantization advances – Sub-4-bit inference with minimal quality loss

== Further Resources

  • “Attention Is All You Need” (Vaswani et al., 2017) – The original Transformer paper
  • “Scaling Laws for Neural Language Models” (Kaplan et al., 2020)
  • “Training Compute-Optimal Large Language Models” (Chinchilla, 2022)
  • DeepSeek-V3 Technical Report (December 2024)
  • Anthropic Constitutional AI paper
  • Hugging Face Transformers library documentation

== License

This article is informational. LLMs themselves are released under various licenses depending on the developer. Open-weight models (Llama, Mistral, DeepSeek) use permissive licenses (MIT, Apache 2.0, or custom open-source licenses). Closed models (GPT-4, Claude, Gemini) are proprietary commercial products with usage-based pricing and terms of service restrictions.

Other authors

No one else has published this topic yet.