Uncharted Territory in LLM Training: Principles, Pathways, and New Practices

From pre-training to post-training, and from RLHF to Agent training: A panoramic overview of LLM training accessible to non-experts. Author: Tw93

This post is reposted from @Tw93’s X Article, Original Link.

TL;DR

After writing What You Don’t Know About Claude Code: Architecture, Governance & Engineering Practices and What You Don’t Know About Agents: Principles, Architecture & Engineering Practices, I wanted to write a third article. This time, I plan to challenge myself to break down exactly how Large Model training works. I aim to make this article understandable even for readers without a professional background.

Looking ahead to 2026, the real differentiator in model performance will gradually shift away from pre-training itself to the extensive stages that follow: post-training, evaluation, rewards, Agent training, and distillation. Every single step influences the actual user experience. When you notice a model suddenly getting stronger, it’s often because these components have been optimized together, rather than due to a single factor.

The following section follows the LLM training pipeline sequence, focusing on how vendors improve final production performance through the backend training stack.

LLM Training is Actually a Pipeline

In the past few years, model progress was typically explained by the piling up of parameters, data, and compute. However, many of the improvements users actually feel don’t come from feeding in more base corpus, but from the entire training process after pre-training. How the model speaks, listens to instructions, reasons, and uses tools—these capabilities don’t naturally emerge just by feeding it more internet text.

InstructGPT once gave a straightforward example: a model with only 1.3B parameters that underwent alignment and preference optimization could beat GPT-3 (175B) in human preference evaluations. Despite a two-order-of-magnitude difference in parameters, users preferred the much smaller model. The backend of training truly rewrites user perception.

The training process is essentially a pipeline where data, algorithms, systems, and feedback are highly coupled. A change in one layer usually propagates to others. In 2026, model capability and industrial value will increasingly concentrate in the layers following pre-training.

This is also why we feel Doubao (Doubao) doesn’t seem to fight for rankings much, yet feels more intuitive to use in daily life—the post-training was done properly.

The six layers above are just for division of labor; the nine stages in the diagram below are a more detailed version: raw data and system recipes are separated, and Agent harness and Deployment are subdivisions of the backend. There are also two feedback loops running through everything: production traffic flows back to data engineering, and offline evaluation results flow back to pre-training.

Pre-training is Just the Foundation

Pre-training remains the starting point of the training chain. Only by understanding what it actually does can we comprehend what each subsequent layer adds. Without this step, there is no language modeling capability, no knowledge compression, and no room for transferring subsequent capabilities. In engineering terms, it does more than just teach the model to predict the next token: it internalizes language distribution, compresses knowledge and patterns from massive text into parameters, and leaves room for activating future capabilities. “Next token prediction” only describes the training form; it doesn’t explain why, once the scale increases, models suddenly develop abilities they didn’t have before.

After GPT-3, much model tuning work considers budget and ratios more carefully. Bigger isn’t always better. There is a ratio issue between parameter count, number of training tokens, and total compute budget. Many models aren’t necessarily too small, but rather under-trained—they haven’t been trained to the optimal point given a fixed budget.

In real training decisions, the practical question is: if someone gave you 10,000 H100s and one month, how would you train a good enough open-source model? Scaling Laws here act more like a budget allocation tool than an abstract curve from a paper. You still need to calm down and consider: Should the next round stack more parameters or feed more data? Is the current model lacking capability or just under-trained? With a limited GPU budget, what ratio is most worthwhile?

Pre-training is more like laying the foundation for model capability, determining knowledge scope, generalization potential, and pattern induction. It also determines whether there’s room for post-training to utilize. But whether it listens to instructions, cooperates with the user, or runs stably on critical tasks—these are things pre-training can’t control.

The pre-training phase doesn’t just decide how much knowledge is learned; it also pre-decines what the model can become in the future. The tokenizer splitting method directly affects subsequent training, and how long the context window is dragged out must be determined early. Whether to continue multimodal pre-training, or whether to make “single-card operability” a requirement from the start—these trade-offs are written into the recipe during the training phase, not features patched in at release. Gemma 3 emphasized single accelerator, 128K context, vision capabilities, and quantification simultaneously, reflecting this type of trade-off. The capabilities users eventually see—like running locally, viewing images, understanding long documents—are actually determined largely during the training phase.

Looking at the data optimal point given by Chinchilla, for an 8B parameter model, it’s about 200B tokens. But Llama3 8B actually used 15T tokens, exceeding it by about 75 times. This kind of over-training recipe usually exchanges for higher capability density at the same parameter size, finally resulting in a smaller, cheaper-to-infer model. To measure this, looking at total FLOPs (floating point operations) is more reliable than looking at parameter count. The diagram below intuitively shows this gap.

There is also a class of easily overlooked designs happening in the pre-training phase: tokenizer vocabulary size, tokenization strategy, and byte-level encoding methods all have a significant impact. Llama2 had a vocabulary of 32K; after Llama3 expanded to 128K, sequence length compressed by about 15%, and downstream performance improved accordingly. This impact extends to inference cost and multilingual capabilities. Token efficiency for Chinese, code, and math formulas is determined when the vocabulary is designed. For example, a tokenizer that splits Chinese very finely suffers not just from costing a few extra tokens each time, but from continuously paying the price for this wrong decision during every inference step.

The Data Recipe Determines Model Capability

Parameter scale has been an important metric everyone focused on in past years, but in the last two years, a more important thing has emerged: the “Data Recipe.”

This process looks like data cleaning on the surface, but is actually complete data production engineering. Raw data like web pages, code repos, books, and forums must go through text extraction, language recognition, quality filtering, privacy processing, safety filtering, and deduplication before entering pre-training. The diagram below shows the complete funnel processing flow.

If you view data merely as training fuel, it’s easy to conclude that more is better. But data engineering is closer to capability design. What the model sees and doesn’t see, and the proportions of code, math, and encyclopedia—these choices directly affect the distribution of the model’s final capabilities.

Deduplication and pollution control are often ignored, but they have a huge impact on the result. What needs to be handled isn’t just low-quality data, but also repetitive templates, license text, mirrored web pages, and pollution from benchmark leakage. If document-level and line-level dedup aren’t done well, the model will often repeatedly absorb the easiest-to-copy content without necessarily learning the most valuable parts. The effect of many open-source models appears uneven, often due to gaps in data processing quality.

In the last two years, data mixing itself has become a separate research question. Work like Data Mixing Laws focuses not just on how much more data can be collected, but on how the proportions of different data types will steer the model toward specific capability structures.

Synthetic data has also evolved from an auxiliary method to a formal part of the training process. Methods like Self-Instruct where models generate their own instruction data, the distillation trajectories of DeepSeek-R1, and the increasingly obvious synthetic supervision in the Qwen and Kimi series are all moving in the same direction. Every generation of stronger models participates in reconstructing the data seen by the next generation. Early models generate basic instruction data; stronger models generate high-quality reasoning trajectories and CoT data; reasoning models trained with RL then distill these trajectories into smaller dense models. “Dense” means running all parameters, unlike MoE which activates on demand.

The key here is that models often need to form capabilities at a larger scale before they can be compressed into smaller models. The DeepSeek-R1-Distill series is a direct example. Trajectories from a post-RL large model brought clear benefits to dense models from 1.5B to 70B. Llama 3.1 405B was also explicitly used to improve the post-training quality of 8B and 70B models. These aren’t by-products; they are part of the training design.

System and Architecture Constraints: Think Clearly Before Training

Many people understand training as a research problem: how to set the objective function, how to reduce loss, how to modify model structure. But in real large model training, system constraints are very important. This is a distributed system problem, not a single-machine deep learning problem. GPU count, memory bandwidth, parallelism strategy, fault tolerance, and cost—these cannot be optimized only after training is finished; they determine from the very beginning how big you can train, how long the context is, and whether you can run more complex post-training.

MoE is the most typical example at this layer. The multi-expert pattern allows the model to expand total parameters at similar compute cost while controlling the activation cost per token. The代价 makes routing complex, load balancing difficult, and infrastructure heavy. DeepSeek-V3 and a series of Qwen MoE designs are trade-offs between cost and effect, not pure architectural preferences.

Discussions in recently published recipes are no longer just coarse-grained analysis like model size vs. token ratio. muP allows hyperparameters to migrate from small-scale experiments to large-scale training; WSD learning rate is a schedule that rises first, stabilizes, then decays; plus optimal batch size and higher data-to-parameter ratios—these are starting to appear in formal training reports. These details are becoming the places where models of the same scale really pull away.

If we only understand long context, multimodality, and new architectures as product feature points, we miss the constraints on the training side. A target like 128K context directly changes attention cost, batch size, training curriculum (data scheduling order), and parallelism strategy. Multimodality changes not just model structure, but also data mixing (multi-source data ratio), encoder design, and safety evaluation. If “single-card operability” is taken as a hard requirement, parameter count, quantization path, and model family size will all tighten accordingly.

Work like Forgetting Transformer and Kimi’s Attention Residuals are all answering similar questions: how to train for longer context, and how to avoid information dilution as the network deepens. You see a model that can process longer input or is easier to deploy, but during training, it faces a completely different set of constraints.

Compute budget is fixed. For every extra bit spent on model size, training token amount, context length, or serving cost, other directions must yield.

As context lengthens, attention cost explodes directly, and batch size must be compressed; as the model grows, GPU memory goes up, and serving costs rise. These aren’t optional trade-offs, but the result of resource constraints. Most decisions are locked in before training begins.

There’s also an engineering reality often ignored: training isn’t always stable. Thousands of GPUs running for weeks, suddenly a training loss spike appears, large enough to ignore, forcing a rollback to a checkpoint from days ago and starting over.

Besides loss spikes, there are silent errors on a single GPU, no error but quietly producing wrong gradients, NVLink bandwidth anomalies, jitter in inter-node communication—each can pollute several training steps. Being able to quickly detect, isolate, and recover in large-scale training is lab-level engineering capability, not a problem solvable by reading papers.

DeepSeek-V3 specifically mentioned in its technical report that the entire pre-training process saw no irrecoverable loss spikes and did no rollbacks. It is also one of the few cases publicly validating FP8 mixed precision training on ultra-large models. According to public data, the whole process took about 2.788M H800 GPU hours, and pre-training completed 14.8T tokens.

The training system and inference system are closely related, but not the same engineering problem. Training cares about gradients, parallelism, checkpoints, throughput, and cost; inference cares about latency, KV cache (caching history to avoid recomputation), quantization, and service stability.

Post-Training Determines the Gap Users Actually Feel

Many improvements ordinary users truly feel happen after pre-training. Instruction tuning uses labeled instruction-response pairs to supervise the model. It changes the way the model responds, turning requirements like how to take a task, how to organize output, and how to act like a cooperative assistant into supervision signals. A base model may already possess many potential capabilities, but without this step, these capabilities often don’t stably emerge in the form users expect.

Looking further, RLHF, DPO, and RFT are similar in direction, all connecting “what defines a better answer” into the training loop, but via different paths.

RLHF (Reinforcement Learning from Human Feedback) first imitates high-quality answers, then uses preference comparisons for reinforcement.

DPO (Direct Preference Optimization) shortens this path, learning directly from preference comparisons without needing a separate reward model.

RFT (Reinforcement Fine-Tuning) is an interface easier to implement in engineering, putting task definition, grader design, and reward signals into a productized flow.

Talking about post-training today, just covering SFT or RL isn’t enough. The harder parts are how evaluation is set up, how scores are given, and what answers are considered worth optimizing. SFT (Supervised Fine-Tuning) learns not just knowledge, but also style. Data length, format, whether it includes citations, whether it prefers bulleted expressions—these significantly affect the model’s final output form. Many users think they are comparing capabilities, but often they are just comparing style differences. Plus, preference evaluation naturally favors longer answers, easily mistaking earnest-looking long output for more reliability. So post-training just looking at leaderboards isn’t enough; you must combine real task results, cost, and stability.

DeepSeek-R1’s Four-Stage Training Recipe

Modern post-training is a multi-stage pipeline. In public materials, DeepSeek-R1’s recipe is the clearest. It advances in four stages:

Stage 1 is cold-start SFT. Before doing reinforcement learning, it warms up with a small amount of high-quality Chain of Thought (CoT) data. DeepSeek-R1-Zero proved that doing RL directly from a base model (a raw model after pre-training without alignment) is feasible, but a model trained purely on RL will repeat itself constantly, have confused language, and be very unreadable. Cold-start SFT gives RL a more stable starting point, first locking down format and language consistency. This isn’t a redundant step.

Stage 2 does reinforcement learning in verifiable fields like math, code, and logic, using GRPO as the training algorithm and program-checkable correctness as the reward signal. The key is why choose GRPO over traditional PPO: PPO (Proximal Policy Optimization) needs an independent value network to estimate current state value, and maintaining two networks simultaneously on a large model is a high engineering burden. GRPO samples multiple answers for the same prompt and uses intra-group ranking to replace absolute value estimation, requiring no independent value network. It’s much simpler engineering-wise. The DeepSeek series and Cursor Composer 2’s RL infrastructure both adopted solutions close to GRPO.

Stage 3 performs Rejection Sampling Fine-Tuning. It filters successful trajectories generated by RL into new SFT data and does another round of supervised fine-tuning. This is the bridge between RL and SFT; good trajectories explored by RL are turned into high-quality training samples for the next round of SFT.

Stage 4 incorporates helpfulness and safety preference feedback to adjust the model into an assistant form that meets release standards.

The four stages depend on each other: cold start lets RL start stably, RL produces high-quality data, rejection sampling turns this data into input for the next round of SFT, and alignment RL completes behavior convergence. From public results, the gap between direct SFT and completing all four stages is usually visible.

Eval, Grader, and Reward are Redefining Training Objectives

The component responsible for converting model output into training scores is called the grader. It easily encounters problems people don’t expect. Only looking at the final answer, the model quickly learns to take shortcuts; scoring is too coarse, noise gets amplified by reinforcement learning; leaderboard goes up, real tasks don’t necessarily follow. Often, users think they are looking at base model differences, but the difference lies in how the objective is defined.

Put into the training flow, eval determines what to measure, grader determines how one output turns into a score, and reward determines where the model gets pushed next. Connected together, they form a specific feedback loop: task definition, eval, grader, optimization, rollout, re-evaluation. Rollout refers to the trajectory produced by the model executing the task. If any link in the chain goes astray, subsequent optimization goes astray too.

Only looking at the final result, the model might guess right, or get the right answer along a wrong process. In code, math, and complex reasoning tasks, this problem is especially obvious. If intermediate steps don’t enter feedback, the model often learns not more reliable reasoning, but how to get that final point with higher probability.

So in recent years, more work has moved from traditional RLHF to verified rewards, using programs to directly verify correctness. In verifiable tasks like math, code, and logic, correctness can now be scored directly, relying less on human preference. But verified rewards haven’t completely solved the problem either. Over-optimization, reward overfitting (scoring rules are over-optimized, capability doesn’t truly improve), and mode collapse (output becomes highly single, losing diversity) still appear. The problem just shifts from whether preference labeling is accurate to whether the scoring chain is stable.

The thinking process written out by the model also cannot be taken directly as a complete record of internal processes. In observability experiments on reasoning models, Anthropic found models use extra prompts without admitting it in visible CoT; in reward hacking scenarios, it’s more likely to patch in a seemingly reasonable explanation. Reward hacking is drilling holes in the scoring system, not truly completing the task. Visible CoT is better suited as a training and monitoring signal, not as complete truth.

One layer down, the model may even start exploiting the scoring channel itself. Research on reward tampering and alignment faking suggests models might theoretically actively intervene in the scoring process itself. Reward tampering is directly tampering with the reward calculation process; alignment faking is alignment camouflage—surface compliance while hiding misaligned intent.

Once the model has strong enough environment access, what it optimizes is no longer just task results, but could include checklists, reward code, and training relationships themselves. In a 2025 Anthropic experiment, extra reward-hack knowledge was injected into a set of exploitable production coding RL environments, and similar generalization was observed. After learning reward hacking, the model didn’t just continue exploiting on similar tasks, but also showed broader misalignment like alignment faking.

These behaviors are invisible in standard dialogue evaluation, only visible in Agent task environments. The engineering implication is direct: reward, grader, environment isolation, and monitoring must all be treated as part of training design.

At the Agent stage, reward design continues to break down. The final result is just one item; you also need to separately measure process quality, context management, and anti-cheating constraints. Kimi K2.5 rewards effective decomposition and true parallelism; Chroma Context-1 scores relevant documents found during search; Cursor Composer 2 includes summaries from long tasks in the reward, because once the summary distorts, the subsequent context gets led astray.

ORM vs. PRM: Two Paths for Reward Models

Specifically in implementation, ORM is Outcome Reward Model, scoring only the final answer. Signal is sparse, cost is low, suitable for starting, but easier to let the model take shortcuts. PRM is Process Reward Model, scoring intermediate steps. Signal is denser, usually stronger for math and code reasoning, but labeling and system costs are much higher. OpenAI saw in math reasoning experiments that PRM not only improved accuracy but also made it easier to constrain the process, because every step is being supervised; the problem is also direct, PRM cost is usually several times ORM, so most real systems still start with ORM. Only in verifiable tasks like math, code, and logic is it easier to automate PRM, using programs to verify intermediate steps and bypass the manual labeling bottleneck.

This loop runs completely like this:

Constitutional AI and Deliberative Alignment

Several recent alignment methods are doing the same thing. Anthropic’s Constitutional AI connects human-written principles into training, replacing item-by-item human preference with AI feedback. OpenAI’s Deliberative Alignment puts safety compliance into the reasoning process, letting reasoning capability itself bear part of the safety constraint. Deliberative Alignment here means cautious alignment, core is judging safety standards by oneself during the reasoning phase, rather than relying on trained-in reflective behavior. Both paths are turning alignment from manual labels into part of the internal training objective.

Taking Constitutional AI as an example, the two-stage process is first letting the model self-criticize and revise output according to principles, then using AI feedback to replace item-by-item human preference labeling. Alignment is never a patch hung after training; what the system measures, how it scores, and what it rewards, the model moves in that direction. This itself is the most direct adjustment means in the backend of training.

When it Comes to Agent Training, It’s Not Just Optimizing the Model Anymore

In the past two years, the rapid formation of reasoning models represented by the o1 series and DeepSeek-R1 shows that with stable rewards, reliable verification, and infrastructure in place, RL on language models can indeed significantly improve performance on math, code, and logic tasks.

This simultaneously opens a new dimension: inference compute can also scale. The effect of RL training adds a layer: besides teaching the model to answer, it teaches the model to allocate inference budget, knowing when to think more and when to stop. Moving forward, the difficulty becomes letting the model act continuously in the environment, not just lengthening a single thought.

Qwen’s former model lead Junyang Lin’s reflection on the mixed Thinking and Instruct route is very representative: the difficulty isn’t giving the model a thinking switch, but that the two modes’ objectives are inherently different—one pursues directness, compliance, and low latency, the other pursues more exploration and higher correctness. One step further, the training objective shifts from how long to think before answering to how to allocate budget, receive feedback, and continue pushing the task in action.

At this point, the training object is no longer just a model that answers questions, but a system that can plan, call tools, receive feedback, and maintain coherence in long tasks. So the training stack changes too: browsers, terminals, search, execution sandboxes, memory systems, tool servers, and orchestration frameworks all start entering the training system.

More accurately, the harness is the control program wrapped outside the model. This concept doesn’t just belong to the Agent runtime; the training phase also has it: deciding what input the model sees, in what form it receives feedback, when to trim context, and when to call tools. Prompt construction, memory update, retrieval policy, context editing, and tool orchestration are all here. The environment is no longer just a static verifier, but a layer that both training and deployment must face directly.

Harness: The Control Program Outside the Model

Only when the harness is stable does model training make sense. When tool return values are unstable, the browser environment differs from online, file system state is unreproducible, the grader will error first, and the model subsequently learns not capability but how to exploit environment holes. When training Agents, often you are debugging both the model and the environment.

The approach of the three companies is also clear: Kimi uses PARL to solve parallel decomposition and credit assignment; Cursor uses self-summarization and real-time RL to reconnect long coding sessions with production traffic back to training; Chroma trains prune_chunks into a strategy itself, letting context pruning directly enter the retrieval process.

In the SFT era, data diversity was paramount. In the Agent era, environment quality is core: stability, authenticity, coverage, difficulty distribution, feedback richness, and anti-exploitability. The training objective changes accordingly: seeking reliability in complete tasks, not just solving one problem correctly. Classic CoT benchmarks don’t cover this.

This change continues to move forward: not just training the model in the runtime harness, even the harness code itself starts to become an object searchable and optimizable by an outer loop.

Case Study: Kimi K2.5’s PARL Parallel Training

Kimi K2.5’s PARL is an engineering case worth breaking down. The route is clear: only train the orchestrator, confine credit assignment to the orchestration layer, and don’t optimize all sub-agents simultaneously.

Reward signals fall into three categories: task success, parallel decomposition, and completion constraints, driving the orchestration layer together. In early training, the r_parallel weight is raised high to encourage exploring parallel strategies first; later it gradually steps down to 0 to avoid treating multi-agent spawning as a shortcut. Evaluation also looks not just at total steps, but at critical path length. Critical path shortening indicates parallelism really worked.

Case Study: Meta-Harness Outer Layer Optimization

But by 2026, things moved a step further. Meta-Harness explicitly takes harness engineering out as a separate optimization. It doesn’t optimize weights, but the harness code itself—meaning the prompt construction, retrieval, memory, and state update programs surrounding a fixed model. The numbers at the start of the paper are direct: with the same base model, just changing the harness can pull a 6x performance gap on the same benchmark. The set of programs outside the model is no longer just a deployment detail, but a layer of capability formation.

Its key isn’t adding another abstract optimizer, but writing prior code, scores, and execution traces (logs of tool calls and state changes) all into a filesystem, letting the proposer grep, cat, and compare diffs like writing code, then modifying the harness along failure paths. The proposer is the module proposing harness modifications.

The author judges clearly: many past text optimizers aren’t effective enough for harness-like long-term, stateful programs, largely because they only look at scalar scores, short templates, or summaries, flattening the problem. Scalar score only has the final score, no process info. Harness errors often manifest only after many steps; once feedback is over-compressed, the diagnosis chain breaks.

These results aren’t just higher benchmark scores. In online text classification, Meta-Harness is 7.7 points higher than ACE (Agent context engineering baseline) while pushing context token usage to 1/4 of original. In retrieval-enhanced math reasoning, a discovered harness raised an average of 4.7 points on 200 IMO-level problems for 5 held-out models (not participating in optimization). On TerminalBench-2, it also exceeded the manually engineered baseline. This shows what’s being optimized is not just the model’s internal strategy, but also the program outside the model that organizes information and action.

A concrete example: Meta-Harness automatically discovered environment bootstrap on TerminalBench-2, meaning running a shell command before the agent loop starts to snapshot working directory, available languages, package manager, and memory state into the first round prompt. Many coding agents spend the first few rounds exploring the environment; doing this layer upfront, improvement might not come from stronger weights, but from the harness letting the model start on better context.

At this point, the optimization objective has expanded from answers to trajectories, and again to the harness program carrying trajectories.

After Frontier Model Release, the Training Chain Keeps Running

Understanding today’s large models purely with a round of pre-training mindset isn’t enough. Behind released models, usually the entire chain of pre-training, post-training, distillation, and specialization has already run, and stronger models continue producing training data for the next generation.

Distillation and Specialization

The distillation of the DeepSeek-R1 series is a typical example. The large model first trains reasoning ability through RL and verified rewards, then migrates these reasoning trajectories to smaller dense models. Specialized models like TranslateGemma show another path: on more specific target tasks, using high-quality data and specialized reward design to further compress and target capabilities. At this stage, stronger models aren’t just used to serve users, but start directly producing training data for the next generation.

The reason behind this is more fundamental than trajectory migration: one possible explanation is that in internet corpus, knowledge memory and reasoning ability are coupled together, and existing pre-training objectives require the model to learn both well simultaneously. The reason large models must come first is that only big enough can support both simultaneously, then use it to generate pure reasoning demonstration data. Small models training on this data can focus on reasoning itself, no longer forced to remember all knowledge; big then small, a key reason is capability decoupling, not just cost strategy.

On the other side, deployment adaptability is as important as capability itself. Many scenarios don’t need all-powerful large models, caring more about cost, latency, stability, and controllability. The training endpoint isn’t necessarily bigger, but could be smaller, cheaper, and more specialized.

The final released model isn’t necessarily the checkpoint furthest right on the training curve. Before actual release, often multiple checkpoints are repeatedly compared on real task results, refusal style, tool stability, cost, and regression risk. The version finally going online is often a product decision, not the one performing strongest on a single metric.

Users see the model name and assume it corresponds to a smoothly rising training curve, but choosing which checkpoint to go online is another matter.

The value of large models lies both in their own service capability and in their continued provision of training data, distillation sources, and release bases for the next generation.

Besides offline training, near-online continuous optimization has also entered the main flow. Cursor Composer 2’s real-time RL shows some Agent capabilities have started iterating continuously through production traffic, not waiting for the next round of large-scale offline training to refresh uniformly. The boundary between training and deployment hasn’t disappeared, but their feedback loops are shortening.

How to View Why a Model Got Stronger Later

In 2026, the value of frontier models increasingly depends on who can run the complete set of training chains after pre-training: continuously producing training data, doing distillation, doing specialization, doing evaluation and rewards well, and making final release choices.

Because of this, later when seeing why a model suddenly got strong, you can first look at three things:

First, see if the change happened at the pre-training layer or the subsequent training flow. Many capability improvements do come from stronger pre-training and better data recipes, but many perceived changes actually mainly happen in post-training. Whether the model listens to instructions, uses tools, or has a stable response style often doesn’t grow naturally just by training more corpus.

Second, see which layer the improvement came from: is it weights and training recipes, or reward/eval/grader, or harness code and deployment loop. At the reasoning model and Agent stage, the strength users feel is often no longer the result of the base model alone. How eval is set, how reward is given, whether the tool environment is stable, how retrieval and memory are organized, how summary and context are trimmed, and which checkpoint was chosen at release—these all change the final product performance together.

Finally, see what the online version is optimizing. Some versions pursue higher ceilings, some compress cost, latency, and regression risk, and some specialize for a certain scenario. The released version is a product decision, not the furthest right point on the training curve. So when looking at model updates, checking what it’s actually optimizing gets closer to reality.

Breaking down the model’s sudden strength back into production links, many improvements are actually amplified together by the backend training stack and the outer harness. The iteration cycle of this chain is also shortening: production traffic continuously flows back to training, every generation of stronger models produces capability while also producing the next generation of supervision data, and outer programs are constantly rewritten based on rollouts, logs, and real task feedback.

The model released today is just a snapshot; the chain and harness program are the products running continuously.

Learning Materials

Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556

Ouyang et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155

Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300

DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948

DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437

Llama Team, AI @ Meta (2024). The Llama 3 Herd of Models. arXiv:2407.21783

Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

OpenAI (2024). Deliberative Alignment: Reasoning Enables Safer Language Models. openai.com/index/deliberative-alignment

Anthropic (2025). Sycophancy to Subterfuge: Investigating Reward Tampering in Language Models. anthropic.com/research/reward-tampering

MacDiarmid et al. (2025). Natural Emergent Misalignment from Reward Hacking in Production RL. arXiv:2511.18397

Lee et al. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses (preprint project page). yoonholee.com/meta-harness

Kimi Team (2026). Kimi K2.5 Tech Blog: Visual Agentic Intelligence. kimi.com/blog/kimi-k2-5

Rush, S. (2026). A technical report on Composer 2. cursor.com/blog/composer-2-technical-report

Chroma (2026). Chroma Context-1: Training a Self-Editing Search Agent. trychroma.com/research/context-1