Home Tags About |

Posts with tag personal

I Watched Eric Jang Rebuild AlphaGo So You Don't Have To

2026-05-25
22 min read4,255 wordsartificial intelligencecomputer sciencepersonalwriting

In April, Eric Jang spent two weeks rebuilding AlphaGo Zero from scratch, and put the result online as autogo along with a long, interactive write-up. Last week he sat down with Dwarkesh Patel for a two-and-a-half hour blackboard lecture on what he learned. The thread that holds the conversation together is a question about what the AlphaGo recipe still has to teach us in a world where the dominant paradigm has shifted to large language models trained with policy-gradient RL.This post walks through the ten conceptual blocks of that conversation, in the order they were laid out, with one important correction Eric issued as an errata after recording. The aim is to understand what makes AlphaGo's training loop so elegant, why nobody has been able to port it cleanly to LLMs, and what the limits of current RL look like once you take the AlphaGo perspective seriously.Contrary to what the title suggests, if you're interested in this type of work, I advise you to watch the podcast episode.1. Why brute-force search dies on GoGo is decided by territory under Tromp-Taylor scoring, which is the rule set computer scientists actually use because it has no ambiguity about when a game ends. This matters more than it sounds: it means the AI has a clean reward signal at terminal states, which the human pro game does not have (humans resign).On a 19×19 board there are at most 361 legal moves at the opening, and the branching factor decreases by one each ply because pieces don't move. Games run roughly 300 plies under Tromp-Taylor. The naive game tree therefore has on the order of $361^{300}$ leaves, which is more states than there are atoms in the observable universe.Naive Go tree explosionThis is the situation any tabula-rasa Go program walks into. You cannot enumerate. You have to decide which branches are worth exploring. Eric frames the breakthrough of AlphaGo as a tractable answer to two coupled questions: how do you prune the breadth of the tree (which moves to even consider), and how do you prune the depth (when to stop simulating and just estimate the value of the resulting position).Classical Monte Carlo Tree Search before AlphaGo handled the first question with UCB1 (Upper Confidence Bound 1), a multi-armed-bandit heuristic that selects the child node maximizing$$ a^* = \arg\max_{a},\Bigl[,Q(s,a) + c\sqrt{\frac{\ln N(s)}{N(s,a)}},\Bigr]. $$A few definitions I had to nail down to make sense of the data structures:root node: the current state of the board.children: the states reachable from the root by one legal move.$Q(s,a)$: the mean value of leaves reached through this edge.$P(s,a)$: the probability of taking action $a$ from $s$ (this only enters once we have a policy network).The first term is the exploit term, an online running mean of leaf values reached through that edge. The second is the explore bonus, which grows in proportion to the parent's visit count $N(s)$ and shrinks with this child's visit count $N(s,a)$. UCB1 has regret bounds, and it gives you a principled way to allocate simulations across branches.The trouble is that UCB1 has no opinion about which children are a priori worth visiting. With 361 candidate moves, you waste an enormous number of simulations early, sampling stupid moves the same way you sample promising ones. Eric's framing here is sharp: classical MCTS treats every action as a uniform prior over a wide bandit. That is fine for a 10-armed bandit. It is hopeless for a 361-armed one with 300 levels of nesting.2. PUCT, and why the prior is the whole gameAlphaGo's first major change is to swap UCB1 for PUCT (Predictor + Upper Confidence applied to Trees):$$ a^* = \arg\max_a,\Bigl[, Q(s,a) + c_{\text{puct}}, P(s,a),\frac{\sqrt{N(s)}}{1+N(s,a)},\Bigr]. $$PUCT formulaThe new term is $P(s,a)$, the policy prior. This is the probability assigned to action $a$ by a neural network evaluated at $s$, and it is written into the tree node exactly once, at the moment that node is expanded. Everything else in the formula remains a search-time statistic: $Q$ is the running mean of leaf values, $N(s)$ is the parent's visit count, $N(s,a)$ is this edge's visit count.On the first visit to a new node, $N(s,a)=0$, so the explore term reduces to $c_{\text{puct}},P(s,a)\sqrt{N(s)}$. Children with high prior $P(s,a)$ get visited first. As you keep visiting a child, $1+N(s,a)$ in the denominator grows linearly while $\sqrt{N(s)}$ in the numerator grows only as a square root, so the explore term decays and $Q(s,a)$ takes over. The prior thus determines the order in which children are tried; visit counts determine when you stop trying them; and $Q$ takes over as soon as the search has enough evidence.In a Go bot trained from scratch, the prior carries roughly all the search-time information about which moves are not obviously stupid. The value head tells you when to stop searching. Without a good prior, MCTS still spreads itself across all 361 moves and the search depth never becomes tractable. This chapter handles breadth, depth is handled by a second output of the network, the value head.3. The two-headed networkAs Eric puts it: humans look at the board and instinctively calculate the probability of winning 100 moves before the game ends. A neural network can amortize that calculation, replacing a 100-move rollout with a single forward pass. Once you have a value head, you do not have to simulate to the leaf; you just stop at any non-terminal state and trust the value head's prediction.AlphaGo's neural network takes a board state and outputs two things:Policy head, $\pi_\theta(a\mid s)$: probability distribution over good actions. Prunes breadth.Value head, $V_\theta(s)$: probability of winning from this state. Prunes depth.AlphaGo two-headed policy and value networkA natural question is whether the policy head is even necessary. If $V_\theta$ tells you the probability of winning from any state, why not just enumerate the next states $s'$ that follow from each legal move $a$, evaluate $V_\theta(s')$ on each, and play the move with the highest value? Two reasons that come up in the lecture. First, this requires running the network forward up to 361 times per move, where one forward pass of the policy head gives you all the move probabilities at once. Second, an argmax over values is a single point estimate; the whole point of training is to distill what MCTS computed (which is more than a single argmax) back into the policy network. A single forward pass cannot encode a search.A second, more architectural question that comes up: why convolutional ResNets, when the rest of the field has moved to Transformers? Eric tried Transformers at his scale and could not beat ResNets. His read is that Go fighting (captures, ladders, life-and-death problems) is intensely local. Convolutional receptive fields encode "what is near this stone matters most," and a useful local pattern is reused across the board. At larger scales he expects Transformers can learn this inductive bias from data, but at the budget he was working at, the CNN prior won.And finally, the network sees only the current board, not the history. Go is a perfect-information game and there is a Nash equilibrium strategy that depends only on $s$. In hidden-information games like poker or Diplomacy, the value of your hand depends on opponents' earlier bluffs or alliances, and you need an architecture that carries state across time. For Go, you do not.4. Self-play and what gets distilledOnce you have the network and PUCT, the training loop is simple. For each move in each self-play game, the agent does $N$ simulations of MCTS (Eric used numbers from 200 to 2048, the upper range matching KataGo). The four steps of one simulation are: select a path down the tree via argmax PUCT until you reach an unexpanded leaf; expand by adding the leaf as a node; evaluate by running the network on the leaf to get $P$ and $V$; back up by walking the value back to the root, incrementing visit counts and updating $Q$'s running mean at each edge.Four steps of Monte Carlo Tree SearchAfter all $N$ simulations from a given root, the agent picks a move (proportional to root visit counts during training, argmax at inference) and records two things for that state: the MCTS visit distribution $\pi_{\text{MCTS}}(\cdot\mid s)$, and the final game outcome $z \in {-1, +1}$ relative to the side to move. These tuples $(s, \pi_{\text{MCTS}}, z)$ go into a replay buffer.Training, then, is just supervised learning on the buffer:$$ \mathcal{L}(\theta) = \underbrace{\bigl(V_\theta(s) - z\bigr)^2}{\text{value: MSE}} ;+; \underbrace{-,\pi{\text{MCTS}}(s)^\top \log \pi_\theta(\cdot\mid s)}_{\text{policy: cross-entropy}}. $$That is it. No advantage estimation, no TD (Temporal Difference) learning, no PPO (Proximal Policy Optimization), no off-policy importance weights. Just a cross-entropy loss against the MCTS visit distribution and an MSE loss against the game result, scaled to taste.AlphaGo self-play training loopAfter many rounds, the policy network has internalized what MCTS would have computed at every state it has seen. Forward passes through the network alone get progressively closer to the search distribution, which means the next round of MCTS starts from a sharper prior, which means the search is more efficient, which means the visit distribution is sharper still.Policy before and after MCTS distillationYou can see why this is so much better than naive winner-imitation. The MCTS-distilled policy benefits from search at every state, not just at terminal positions where you found out you won. The win-rate curves below illustrate the gain: even with a single forward pass and no MCTS at inference, a network trained on MCTS targets is much stronger than one trained on game outcomes alone. With MCTS layered back on at inference, you get another lift on top of that.Win rate curves for policy and MCTS agentsThe dashed line is a raw policy network with no MCTS at all. The blue line is the same network with MCTS layered on. The red line is a network that has been distilled on MCTS targets, with MCTS again layered on at inference. Even at zero MCTS sims, the distilled network is much stronger. Distillation has packed the search into the forward pass.Eric calls AlphaGo "elegant" several times during the lecture, and this is what he means. You are always operating in a regime where the supervision signal is clean and dense, because MCTS is giving you a strictly better label at every state you visit, not just at the few states that happened to lead to a win. As he puts it near the end of the podcast: in AlphaGo, you never have to solve the exploration problem of "how do I get to a non-zero success rate." Every step is a hill-climb on a beautiful supervised signal.5. Why naive REINFORCE plateausTo see why AlphaGo/MCTS distillation is special, the lecture detours through what doesn't work. Suppose you skipped MCTS entirely and just did naive policy-gradient self-play: play a league of policy checkpoints against each other, find the games where one side won, and reinforce the actions in those games.Eric's worked example: two evenly matched policies play 100 games of 300 moves each. By luck, one of them wins 51 to 49. Imagine only one of those 51 wins came from a genuinely better move; the other 50 are statistical noise. The naive REINFORCE update wants to upweight every action in every winning game. So you get one useful gradient buried inside ~30,000 noisy labels.The variance math:$$ \hat g = R \sum_{t=1}^T S_t \quad\Rightarrow\quad \mathrm{Var}(\hat g) = \sum_t \mathrm{Var}(R S_t) + 2\sum_{i<j}\mathrm{Cov}(R S_i, R S_j), $$where $S_t = \nabla_\theta \log \pi_\theta(y_t \mid x, y_{<t})$. The $T(T-1)/2$ covariance terms give you $O(T^2)$ worst-case variance in sequence length. This is the "quadratic in $T$" point that comes up later in the LLM discussion.A correction worth flagging. In the podcast Eric attributes the quadratic blowup to the multi-step formulation of the policy gradient and suggests this is why LLM labs prefer single-step RL. After recording he issued an errata: the variance grows quadratically with sequence length regardless of whether you formulate the gradient over the full sequence or per-token. In fact, with per-token rewards, multi-step RL has lower variance than single-step. The real reason LLM labs do single-step is that they only have sequence-level rewards (did the code pass, did the answer help), so the per-token formulation gives you the same thing. The takeaway is that credit assignment in long sequences is quadratically hard in the worst case, not that the choice of formulation is what causes the blowup.What MCTS does differently is sidestep the credit-assignment problem entirely. Instead of "this game was won, copy these moves," MCTS says "at every state you visited, here is a strictly better move than the one you played." Every visited state becomes a dense supervision target. There is no noise to dig the signal out of.The classical fixes for RL variance (advantage estimation $R(a) - b(s)$, TD learning, Generalized Advantage Estimation) all try to reduce variance by subtracting an estimate of the average performance from the return. They are correct as far as they go. But they are reducing the variance of an already-noisy signal. MCTS replaces the signal entirely with a denser one.6. MCTS, NFSP, and search in two directionsMCTS is not the only way to assign every visited state a better action. Another option is Neural Fictitious Self-Play (NFSP), used to great effect in DeepMind's AlphaStar and OpenAI Five.MCTS versus NFSP search directionsIn NFSP, you fix an opponent $\pi_b$ and train a "best response" policy $\pi_a$ against it using model-free RL (PPO, V-MPO, Q-learning, take your pick). The reward signal is 1 if $\pi_a$ wins, 0 otherwise. You repeat across different opponents, and you use each best response as a label provider for the corresponding states.Both MCTS and NFSP produce the same thing: for every state $s$ in the replay buffer, an improved action $a^*$ for the student policy to imitate. The difference is where the improved action comes from. MCTS rolls forward in simulated futures and uses the value function to score imagined leaves. NFSP runs actual rollouts to terminal reward, then propagates the win/lose signal backward through TD-style updates over states the agent has actually visited.MCTS searches forward over imagined trajectories; NFSP searches backward over realized trajectoriesThe recipe is "label every state in your buffer with a search-improved action and supervise on that." MCTS is the cheapest way to do that in Go because the game is fully observable and you can simulate forward. In imperfect-information games, NFSP achieves the same thing backward.7. Why MCTS does not transfer to LLMsThe DeepSeek-R1 paper reported that they could not get MCTS to work for LLM reasoning. Eric's diagnosis identifies two structural failures:Unbounded breadth. Go has at most 361 legal moves per ply. The space of "possible next thoughts in a reasoning trace" is essentially unbounded. PUCT's $\sqrt{N(s)}/(1+N(s,a))$ structure assumes you will visit the same child multiple times to refine its $Q$. With language, you almost never sample the exact same continuation twice. The visit-count machinery breaks down.Depth is unbounded and the value head is hard-to-train. In Go, $V_\theta(s)$ is the probability of winning a fully specified game from a fully specified position. In LLM reasoning, what is the value of a half-finished proof? Of a partially written function? You cannot easily train a value head on partial reasoning trajectories because there is no clean termination condition and no clean reward.You could compensate for either one of these. Compensating for both at once is much harder. (People have, with various Tree-of-Thoughts variants). PUCT is a heuristic tuned for the size and depth of Go. It does not gracefully scale to the combinatorics of language.What does work in current LLMs is something that looks like reasoning without an explicit tree structure: models try one approach, notice it isn't working, back up, try another. This emerged from training rather than being explicitly built. Eric does not rule out a comeback for forward search in LLM reasoning, just probably not as PUCT over tokens. (MuZero-style methods on continuous control are still being pushed.)A relevant footnote from the conversation: in 2021, Andy Jones published Scaling Scaling Laws with Board Games, which showed you can trade off training compute against test-time compute in MCTS-driven board games at predictable rates. This is the test-time scaling paradigm later popularized by o1-class reasoning models, anticipated five years early on a Go bot. The catch is that scaling laws only emerge once the underlying recipe is working; if your data is bad or your architecture is wrong, scaling laws will give you confidently extrapolated nonsense. Eric started autogo partly to see if you could Bitter-Lesson your way to a strong Go bot using only scaling laws. His honest answer: you cannot, because you need a working system first to generate the data scaling laws fit to.8. Off-policy training and DAggerOne of the more surprising things about AlphaGo Zero in practice is that its replay buffer is effectively off-policy and it works anyway. By the time you do a gradient step, most (state, action) pairs in the batch were generated by older versions of the policy. RL researchers normally worry about this; off-policy is a known source of instability.Working through the definitions in my notebook:Off-policy means the agent learns an optimal strategy by evaluating and improving a policy different from the one that generated the actions. It learns from what other policies (or past selves) did. Q-learning, deep deterministic policy gradient, and soft actor-critic are classic off-policy methods. They are sample-efficient and exploratory but can be unstable.On-policy means the agent only learns from the actions its current policy produces. PPO is the canonical example.The puzzle with AlphaGo Zero is: why is the off-policy replay buffer stable?The answer is DAgger (Dataset Aggregation, from Stéphane Ross's imitation learning work). The failure mode of on-policy imitation learning is that you train only on the optimal trajectory's states; the first time you drift off, you are in a state your training data does not cover, and errors compound. DAgger augments the training set with off-optimal states labeled with the expert action that would funnel you back.DAgger dataset aggregation diagramAlphaGo's replay buffer is naturally a DAgger dataset. The states in the buffer are mostly along the policy's typical trajectories, with some drift. Every state in the buffer carries an MCTS-derived label that, by construction, points toward winning. So even when you sample old states, the supervision teaches the network to funnel toward the manifold of winning trajectories.The failure mode that would break it is if the buffer contained states the current policy would never visit. Then you train on regions of state space irrelevant to actual play. AlphaGo-Zero-style systems keep this in check by mixing in fresh self-play data.Eric pushed this further in autogo: he ran an experiment where, instead of MCTS-ing every move of a self-play game, he randomly sampled board states and re-ran MCTS on each with the current network. This is much more like a robotics off-policy setup, where a Bellman updater continuously rethinks "what should I have done from this state?" It works. The MCTS labels still funnel back to winning play even from random board states.This is also the strongest connection between AlphaGo and modern robotics off-policy learning. QT-Opt and similar systems do exactly this: a pusher writes transitions into a replay buffer, a Bellman updater continuously recomputes $Q$ targets, and a trainer does supervised regression against those targets. AlphaGo just uses search instead of Bellman backups to produce its targets.9. RL is more information-inefficient than you thinkThe most quotable section of the podcast comes when Dwarkesh and Eric discuss bits per sample. The standard worry about RL is sample inefficiency: you have to roll out a whole trajectory to get any signal. As agents take on multi-day tasks, samples per FLOP keeps falling.The less-appreciated problem is bits per sample. Imagine an untrained LLM facing the prompt "the sky is..." with vocabulary size 100K. Under supervised learning, you are told the answer is "blue" and your cross-entropy loss is $-\log_2 p_{\text{blue}}$, a large number when your model puts low probability on "blue." Under RL, you have to sample "blue" before you get any signal. With random initialization, your probability of sampling it is roughly $1/100{,}000$. You will need on the order of 100K samples to stumble onto "blue" once. Until you do, every sample teaches you nothing.Bits per sample comparisonFormally: at pass rate $p$, supervised learning provides $-\log_2 p$ bits per sample (huge when $p$ is tiny). RL on a binary outcome provides the binary entropy$$ -(p \log_2 p + (1-p)\log_2(1-p)), $$which peaks at $p=0.5$ (a coin flip is one bit) and goes to zero at both extremes. You spend almost all of training in the small-$p$ regime, where RL gives you nearly zero bits per sample while SL would give you a lot.This is the deepest reason AlphaGo's loop is special. Every supervision step is a soft target on the visit distribution, not a one-hot on a single best move. Soft targets carry far more information per sample than one-hot labels: this is the dark-knowledge distillation argument. So AlphaGo's policy network gets the maximum bits per sample at every step. And because MCTS is a strictly better labeler than the current policy, the supervision signal is never flat.A line from the podcast to remember: in AlphaGo, you don't train the policy network to imitate the MCTS action, you train it to imitate the MCTS distribution. The action is one-hot; the distribution is dense. This is dark knowledge.For this to work, three things have to be true simultaneously: the value function has to be cheap and concrete, the action space has to be small enough for PUCT to behave, and you need a fast simulator. Go has all three. Coding, multi-step reasoning, and most economically valuable tasks have zero of three.10. What this means for automated AI researchThe closing chapter of the podcast pivots from algorithms to research workflow. Eric used coding agents extensively for autogo. His honest report is that current frontier models are good at hill-climbing once a track is defined. They can run a hyperparameter sweep, identify that gradients are small in some layer, suggest a code change, run an experiment, and propose follow-ups. He calls this "grad-student-like" execution on a fixed objective.What they cannot yet do, and what Eric repeatedly bumps into, is lateral thinking. When the current track is wrong (the metric is plateauing, the infra has a subtle bug, the framing of the problem is off), the models will keep grinding the wrong axis instead of stepping back to first principles. The actual research insight tends to come from the human noticing "wait, this whole branch of experiments is downstream of a misassumption."This is the inverse of the AlphaGo lesson. AlphaGo works because MCTS gives the policy network a teacher that is always better, on every state. Current LLM coding agents have no such teacher. They have a reward (did the test pass) and a long horizon over noisy intermediate choices. They are in the high-variance, low-bits-per-sample regime that Eric just spent two hours explaining.If you want a single takeaway from the rebuild-AlphaGo exercise, it is this: figuring out how to give an agent a per-state teacher (some equivalent of MCTS's "here is a strictly better action than the one you played") is probably more leverage than scaling RL on sparse terminal rewards. Whether that teacher comes from forward search, backward TD, distillation from a stronger model, or something we have not invented yet is the open question. But the AlphaGo recipe makes it very clear what we are missing.Thank you to Eric Jang and Dwarkesh Patel.Eric's interactive AlphaGo tutorial is the canonical reference, with full code at autogo on GitHub. His errata on policy-gradient variance is short and worth reading

How Frontier LLMs are Trained and Served

2026-05-03
23 min read4,599 wordsartificial intelligencecomputer sciencepersonalwriting

How frontier LLMs are trained and servedThis article is based on my handwritten notes from Reiner Pope's blackboard-style interview with Dwarkesh Patel, plus cross-checks against the published transcript and Dwarkesh's flashcards made in preparation for this episode of the podcast.I would suggest opening this up alongside the video as a study companion.Some interviews are conversations. Reiner Pope's session with Dwarkesh isn't really a conversation, it's a summary of how the whole AI economy works, squeezed onto a single blackboard.During this article you will learn the answers to the questions listed below.Why "Slow Mode" doesn't exist as a product.Why MoE layers map cleanly onto a rack.Why pipeline parallelism doesn't really save you much in inference and why Ilya said it's not wise.Why frontier models are over-trained ~100× past Chinchilla optimal.Why Gemini 3.1 charges 50% more above 200K context and why output tokens cost 5× input tokens.Chapter 1: How batch size affects token cost and speedFirst and foremost, don't worry about the math! There's only one idea. A chip running a model is doing two things at once: computing, and moving data around. One of these is always the bottleneck. If you can write down how long each takes, you can predict almost everything else.The whole lecture is built on one inequality:$$ t ;\geq; \max!\left(t_{\text{compute}},; t_{\text{mem}}\right) $$t is time for one forward pass.The two terms t compute and t mem (mem = memory) expand as:$$ t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPS}} \quad\quad t_{\text{mem}} = \frac{N_{\text{total}} + B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{\text{mem_bw}} $$B is batch size or in other words the number of sequences alive in one forward pass. Not "users." Not "concurrent sessions." Sequences in flight at the moment the model executes the same matmul (matrix multiplication) once. N_active is the active parameter count, the multipliers actually used per token. N_total is total parameter count which means everything sitting in HBM (High bandwidth memory) that has to be paged in.The compute term ignores attention itself, it's a deliberate simplification. The memory term has two sub-terms: a constant for weight fetches, and a term linear in both batch and context for KV-cache fetches. KV cache = B x length of context x bytes per token.mem_bw = memory bandwidthEverything else in this post falls out of those equations.The latency curveLatency versus batch sizeCompute grows linearly from the origin.KV fetch also grows linearly with B, but with a slope set by context length and bytes per token.Weight fetch is a flat constant — it's N_total / mem_bw. It doesn't care how many sequences ride along.The actual latency is the max of the sum of memory terms and compute. On the plot, that's the bold red line, it hugs the memory curve at small batches, then hands off to the compute curve once compute becomes the bottleneck.The two takeaways from this plot:1. There is a hard latency floor. It is N_total / mem_bw. You cannot serve faster than the time it takes to drag every weight from HBM into the compute units once. This is the lower bound, and it's why "Slow Mode" doesn't really exist as a distinct product in LLMs. Providers are already serving as cheaply as they can to stay competitive, and most are subsidizing inference on top of that.2. The crossover is the goal. When the slope of t_KV matches the slope of t_compute, you are simultaneously memory-bound and compute-bound. Either side of that point you're leaving silicon idle. Hitting that intersection is the operational sweet spot.From latency to cost: The cost-per-token plotCost is a different question from latency. The customer doesn't pay for time; they pay rental seconds amortized over tokens served. So divide each curve by B:Cost per token versus batch sizeThe compute curve was linear → becomes flat.KV fetch was linear → also flat.Weight fetch was constant → becomes a hyperbola, plummeting as you grow B.At B = 1 cost goes to infinity (one token shouldering the entire weight fetch). At large B the weight fetches amortize away and cost collapses onto the compute floor.Two things never amortize: compute (every token gets its own matmul) and KV fetches (every sequence brings its own context).Solving for the optimal batch sizeEquate t_compute with the weight-fetch portion of t_mem, ignoring KV for now:$$ \frac{N_{\text{total}}}{\text{mem_bw}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPS}} $$Rearrange so all hardware sits on one side, all model on the other:$$ \underbrace{\frac{\text{FLOPS}}{\text{mem_bw}}}{\text{hardware}} ;=; \underbrace{\frac{B \cdot N{\text{active}}}{N_{\text{total}}}}_{\text{model}} $$The ratio FLOPS / mem_bw is asking: for every byte of memory you can move per second, how many math operations can the chip do per second?On a Blackwell GPU, roughly:FLOPS ≈ 4,500 trillion FP4 multiplies per secondmem_bw ≈ 8 trillion bytes per secondEach FP4 weight is half a byteSo the chip can do about 4,500 / 8 ≈ 560 multiplies for every byte of memory bandwidth. But each FP4 weight is only half a byte, so 560 × 0.5 ≈ 280. Round to ~300. This number has barely budged from A100 → H100 → B100; FLOPS and bandwidth scaled together.So:$$ \boxed{;B \geq 300 \times \frac{1}{\text{sparsity}};} $$For DeepSeek V3, 32 of 256 experts active per token, giving N_total/N_active = 8, which is what gives B ≥ 300 × 8 = 2,400. In practice operators run 2-3× higher because real efficiency lags the roofline. Round number: a 20-millisecond train carrying 2,000-3,000 tokens.The 20-millisecond trainWhy 20ms specifically? Because of HBM drain time, capacity divided by bandwidth is 20ms on essentially every modern HBM generation. On the Rubin generation it is closer to ~288 GB / ~20 TB/s ≈ 15ms.Why does this matter? Because in 20 ms you can read all of HBM exactly once. You don't want to read it twice in one pass, the weight matrices are read-only, and you don't want to re-fetch your KV cache. So 20 ms is the natural cycle time.A "train" departs every drain-time. Any sequences ready when the train pulls in, board it. If the train fills, the rest wait. If it's half-empty, it leaves anyway. Worst-case queueing latency: ~40ms. First train you miss, plus the time to ride the next one.Tokens per second, not concurrent users"Concurrent users" is a fuzzy concept; tokens per second is precise:$$ \text{tok/s} = \frac{B}{t_{\text{train}}} \approx \frac{2{,}000}{15\text{ms}} \approx 128{,}000;\text{tok/s} $$Gemini's reported traffic is in the hundreds of millions of tokens per second globally. So one optimally-batched serving cell handles roughly one one-thousandth of Gemini's load. Which means: to compete commercially you need at least one one-thousandth of Gemini's traffic. Below that you can't even fill the train.Chapter 2: How MoE models are laid out across GPU racksHere's a mixture-of-experts (MoE) layer:MoE layer layoutTokens enter, a router activates k of E experts per token - where E is the total number of experts in the layer and k is how many fire for each token (DeepSeek: 32 of 256). The ratio k/E is what we call sparsity.Each expert is a full MLP (multi-layer perceptron). It does three things in sequence (to put it simply - "expand → think → compress"):1. Up-projection: Multiply the token's vector by a tall matrix to expand it into a much higher-dimensional space. If the token is a 4,000-dim vector, the up-proj might lift it to 16,000 dimensions.2. Nonlinearity: Apply a function like ReLU, GELU, or SwiGLU element-wise (these are different activation functions which can be ignored right now). This is where the layer can actually compute non-trivial functions; without a nonlinearity, two stacked matrix multiplies would just collapse into one.3. Down-projection: Multiply by another matrix that brings the vector back down to the original dimension (4,000). Now the result is the same shape as the input and can flow into the next layer.The outputs of the chosen experts sum together, and that sum is added back onto the residual stream.To put it concretely:new_token = old_token + MoE(old_token)old_token is the residual stream - the running vector that flows through the entire model, end to end. MoE(old_token) is what this layer contributes. Every attention and MLP layer reads from the residual stream and adds its contribution back. The model's final answer is read off the residual stream once all layers have written into it.The standard layout is expert parallelism, different experts on different GPUs. DeepSeek has 256 experts; a Blackwell rack has 72 GPUs (use 64 for divisibility, ignore the other eight). That's 4 experts per GPU. Routing then becomes an all-to-all traffic pattern: any GPU's token can be sent to any GPU's expert.This is why one rack is a natural boundary. Within a rack, NVLink/NVSwitch connects every GPU to every other GPU at full bandwidth, a perfect fit for all-to-all. Cross a rack and you drop onto the scale-out network at roughly 8× lower bandwidth.Scale-up vs scale-outScale-up versus scale-outThree vendors, three names for the same thing:| Providers | Scale-up (intra-rack) | Scale-out (inter-rack) | | --------- | ----------------------------- | ---------------------------- | | NVIDIA | NVLink / NVSwitch | Ethernet (RoCE) / InfiniBand | | AMD | Infinity Fabric | Ethernet / InfiniBand | | Google | Inter-Chip Interconnect (ICI) | Ethernet |Scale-up bandwidths are in the multi-TB/s range per GPU with hundreds-of-nanoseconds latency. Blackwell NVLink is around 1.8 TB/s/GPU. Scale-out per GPU is 400-800 Gbps which is about 3× slower in-rack vs out for Blackwell, ~8× in the general bandwidth-comparison.NVIDIA GPUs can send data directly from one GPU to another.TPUs route differently: TPUs need to traverse all TPUs inside a pod to reach a particular target. Starting with TPU v4 (2021), Google added Optical Circuit Switches between TPU blocks, dynamically reconfiguring which blocks are physical neighbors on a per-job basis.Why scale-up domains keep growing| Generation | GPUs in scale-up | Form factor | |---|---|---| | Hopper | 8 | Tray | | Blackwell | 72 | Rack | | Rubin | ~500 | Rack (much denser) |Hopper-to-Blackwell was mostly a product decision: switch from trays to racks. Blackwell-to-Rubin is roughly a 4× density increase coming from more aggressive cable routing and power delivery. What constrains a rack is power, weight, cooling, and the bend radius of the cables themselves. Modern racks push every one of those to the physical limit.The macro implication is the answer to "why did model sizes only recently start scaling again?" GPT-4 was rumored to be a trillion-ish parameters in 2023. Models meaningfully larger only started shipping in the last six months. The constraint wasn't the algorithm; it was that until the scale-up domain got big enough to host a multi-trillion-param model and its KV cache for thousands of sequences, you couldn't serve it economically.Active parameters are limited by compute cost. Total parameters are limited by scale-up size.That's the equation Google had a head start on: their TPU pods have had large scale-up domains for years.Chapter 3: Pipeline parallelism, micro-batching, and the bubbleExpert parallelism handles one MoE layer. To stretch a model deeper than one rack, you reach for pipeline parallelism: layer 1 on rack 1, layer 2 on rack 2, and so on. Tensor parallelism, cutting along the hidden dimension or FFN (Feed-Forward Network) dimension is a third option in principle, but with experts now small, the math no longer pays off. Tensor-parallel cuts force frequent all-reduce / all-gather operations inside every transformer block, and unless dimensions are huge and the interconnect very fast, the communications overhead eats the benefit.Layer parallelism and expert parallelism, by contrast, split the model into self-contained computational units; each device does meaningful work before having to talk. Layers and experts are the "chunky" dimensions of a transformer, and chunky cuts give better compute-to-communication ratios.When does pipelining hurt scale-up?Set up the ratio of scale-up time to scale-out time. We want scale-up to dominate since it's the precious resource:$$ \frac{t_{\text{scale-up}}}{t_{\text{scale-out}}} ;=; \frac{1}{8},(\text{BW penalty}) ;\cdot; n_{\text{activated experts}} ;\cdot; n_{\text{layers per stage}} ;\cdot; 2 ;\geq; 1 $$The multiplication of 2 is for the all-to-all up and the all-to-all down.The product of the three positive terms needs to beat the 8× bandwidth penalty. Number of activated experts alone is often 8. Add a few layers per pipeline stage and you're comfortably above the bar.This means: pipelining racks together is fine for forward passes; the all-to-all communication stays inside racks, only the residual stream crosses rack boundaries.Inference: No bubblePipeline inferenceIn inference there's no backward pass. Each rack runs its layer; tokens stream forward; the moment a rack finishes batch 0 it picks up batch 1. Set n_micro_batches = n_pipeline_stages and the wraparound is seamless. No bubble. For latency this is neutral, the full forward pass takes the same time whether the layers live on one rack or four, because pipeline stages run sequentially on a given inference. Pipelining just buys you memory capacity per rack.Training: The bubble is unavoidablePipeline parallelism training bubbleIn training the pipeline has to fill, then drain. Forward pass fills the pipe; then you hit a hard stop (between F3 and B3) because backward needs the entire batch's gradient at once; then the backward pass drains it. The hatched regions are the bubble, racks doing nothing while the pipe fills or drains.Why do we hard stop? Because there's an optimal batch size for ML convergence (smaller batches give fresher gradients) and a competing optimum for total training time (smaller batches are worse from a systems perspective). The chosen batch size sits somewhere between these two, creating the trade-off. Once chosen, you do all of it forward, then all of it backward, which is what creates the bubble in the first place.Chapter 4: Why Ilya said, "As we now know, pipelining is not wise."The pipeline bubble has clever workarounds — schedules called zero bubble and one-forward-one-backward interleave the directions to keep racks busy. But this is the part where the famous Ilya line hits: "As we now know, pipelining is not wise."Memory capacity is per rack. If your model doesn't fit, pipelining lets you split it across racks. The famous Ilya line is about the architectural debt it accumulates. Things like Kimi's residual-attention (where each block attends to multiple prior layers' residuals) assumes those residuals are co-located which becomes very hard to implement when those residuals live on different stages. Interleaved sliding-window vs global attention layers create load imbalance. Every constraint slows down research iteration, and in a frontier lab this is a cardinal sin.The bigger memory equationTotal memory per system:$$ \text{Capacity}{\text{mem}} ;=; N{\text{total}} ;+; B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}} $$Per GPU, with E = expert parallelism (e.g. 64) and P = pipelining (e.g. 4 racks):$$ c_{\text{mem}} ;=; \frac{N_{\text{total}}}{E \cdot P} ;+; \frac{b \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{E} $$Note the second term has only E in the denominator not E · P. The Ps cancel each other out. Here's the how and why: Pipelining lets P racks each hold a different chunk of weights. So weight footprint per rack drops by P. But each rack now needs P different sequences alive in its slot simultaneously to stay busy (that's what micro-batching is). So KV-cache footprint per rack is multiplied by P from the micro-batch and divided by P from sharding the cache across stages. Resulting in net zero.Because B = n_micro · b, and n_micro = P (you need that many micro-batches in flight to keep the pipeline busy). Pipelining shrinks the weight footprint per GPU but does nothing for the KV cache footprint. Once P ≥ 2, KV cache becomes the dominant memory bottleneck per GPU.This is exactly DeepSeek's published recipe for V3 inference and what frontier labs are presumably doing. Maximize expert parallelism inside a single scale-up domain and use very little pipelining. Frontier inference is probably running on a single scale-up.The hidden bandwidth win from bigger scale-upThere's another reason scale-up matters even more than capacity:$$ t_{\text{mem, weights}} = \frac{N_{\text{total}}}{\text{mem_bw}} $$And mem_bw here is the aggregate memory bandwidth of every GPU loading weights in parallel, which equals (scale-up size) × (per-GPU BW). Per-GPU bandwidth grew about ~1.5-2× per generation. Scale-up size grew 8× from Hopper to Blackwell. Most of the latency improvement came from having more HBM ports loading weights at once, not from faster HBM.Bigger scale-up results in lower-latency inference which makes longer context lengths feasible. We're still bounded by the KV-fetch term although sparse attention helps at the expense of quality, but it doesn't break the memory wall, which is part of why context lengths have plateaued at 100-200K for the past two years.Chapter 5: Reinforcement learning and over-training beyond ChinchillaWhere the 6ND number comes fromEvery estimate of training cost, Chinchilla's, GPT-4's, anything in a scaling-law paper runs through the same formula: 6ND, where N is active parameters and D is training tokens. Here's where the 6 comes from.For a single multiply-add, the forward pass is 2 FLOPs per parameter per token (one multiply + one add). The backward pass is 2× the forward pass, you compute gradients with respect to both input matrices in the matrix multiplication. So 4 FLOPs per parameter per token for backward, or 2 + 4 = 6 in total.Three buckets, one cost equationTotal compute cost across the three stages of model life:$$ C_{\text{total}} ;=; \underbrace{6 N_{\text{act}} D_{\text{PT}}}{\text{pre-train}} ;+; \underbrace{(2 \text{ to } 6), N{\text{act}} D_{\text{RL}} \cdot \text{ineff}}{\text{RL}} ;+; \underbrace{2 N{\text{act}} D_{\text{inf}} \cdot \text{ineff}}_{\text{inference}} $$The RL coefficient is 2-6 depending on whether you backward-pass on every rollout. The inefficiency factor accounts for the fact that decode (used in RL rollouts and inference) runs at much lower MFU (Model Flops Utilization) than prefill or pure training.The equality heuristicFor costs that trade off, one term grows, another shrinks. The minimum tends to sit where the two terms are equal. The intuition is that at any other point, you're paying more on one side than you're saving on the other, which means there's free movement available.Quick example: f(x) = ax + b/x has its minimum at x = √(b/a), where both terms equal √(ab). This generalizes loosely to any sum of a growing and a shrinking term, which is what training cost (grows with model size) plus inference cost (shrinks with model size, since smaller models serve cheaper) looks like.Applying it here: if a frontier lab is optimizing reasonably well, the three buckets (pre-training, RL, and inference) should satisfy:$$ C_{\text{PT}} \approx C_{\text{RL}} \approx C_{\text{inf}} $$N_active cancels. With ineff ≈ 1/3 you arrive at:$$ D_{\text{PT}} \approx 1.5 \cdot D_{\text{RL}} \approx D_{\text{inf}} $$These three numbers should all be in the same ballpark: Pre-training tokens, RL tokens, and lifetime inference tokens.Plugging in some real numbersInference: ~50M tokens/s globally, deployed for 2 months before obsolescence. → D_inf ≈ 2.6 × 10¹⁴ tokens = (~200T).Pre-training (rumored): ~150T tokens. Same order of magnitude. The heuristic checks out.Active parameters: ~100B.Chinchilla optimum: D_chinchilla ≈ 20 × N_active = 2T tokens.Frontier ratio: 200T / 2T = 100× past Chinchilla.Why no one trains Chinchilla-optimal anymoreChinchilla minimizes training compute for a given final loss. But in production:The training bill arrives once. The inference bill arrives for the entire deployment lifetime.A smaller model trained on 50× more data is slightly worse per training-FLOP (because Chinchilla was the optimum under that constraint) but vastly cheaper to serve. So you over-train.The outcome: every model you ship is exactly large enough that the data you trained it on equals roughly the data it will produce in its lifetime. The training corpus and the output stream are roughly the same size. That's a strange and beautiful symmetry, and it's a direct consequence of the equality heuristic.Chapter 6: Deducing long context memory costs from API pricingChapter 6 is Chapter 1 with a different x-axis.Remember the latency plot from Chapter 1? Compute linear in batch, memory mostly flat at small batch then linear at large batch, with the regime transition where they cross? Replace the x-axis with context length and the same plot reappears, just rotated. The crossover at 200K is the same crossover, applied to a different sweep variable. API pricing tracks it because Google's costs do.The 200K crossover in GeminiGemini 3.1 charges 50% more above 200K context length. Here's why:Cost versus context lengtht_compute = B·N_act / FLOPS is essentially flat in len_ctx.t_mem rises linearly with len_ctx (the KV-fetch term grows).The pricing tier tracks the underlying cost envelope. The kink is where memory time crosses compute time. At the crossover:$$ \frac{B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{\text{mem_bw}} = \frac{B \cdot N_{\text{act}}}{\text{FLOPS}} $$Solve for bytes per token:$$ \text{bytes}{\text{tok}} = \frac{\text{mem_bw}}{\text{FLOPS}} \cdot \frac{N{\text{act}}}{\text{len}_{\text{ctx}}} = \frac{1}{300} \cdot \frac{100\text{B}}{200\text{K}} \approx 1.7;\text{KB / tok} $$And we can decompose that:$$ N_{\text{bytes/tok}} = N_{\text{attn layers}} \cdot 2 \cdot d_{\text{head}} \cdot N_{\text{KV heads}} $$With d_head (dimension of the vector) = 128 and a small KV-head count (1-8), 1.7 KB is consistent with dense attention plus heavy cross-layer KV-cache reuse, the Character.AI / Gemma trick where the global KV is shared across all attention layers. Or via sparse attention. Either way, pricing has leaked architectural information.Competitive pricing pressure forces every API tariff to track its underlying cost structure pretty tightly, if your prices drift too far above your real costs, a competitor with lower costs will eat your margin.Why output tokens cost 5× input tokensDecode versus prefillPrefill processes the entire prompt in parallel. Many tokens per pass. Highly parallelizable. Compute-limited.Decode generates one token per pass, autoregressively. Each token loads the full weights + KV cache. Memory-bandwidth limited.On the time-per-token chart:t_compute / len_pass is flat.t_mem / len_pass is a hyperbola.Both regimes rent the same GPU for the same dollar. What changes is how much of the GPU's FLOPS budget is actually used per second. Prefill saturates the FLOPS, it's compute-bound. Decode underuses them, because the chip is sitting idle waiting for HBM. The 5× pricing ratio is just the ratio of effective FLOPS-utilization between the two regimes.At len_pass = 1 (decode), memory dominates and the cost is ≈ 5×. At large len_pass (prefill), the curve collapses to the compute floor and the cost is ≈ 1×.So the 5× ratio in pricing is telling you the MFU ratio between the two regimes. Decode runs at roughly 20% the model-FLOPs-utilization of prefill.Why cache hits are 10× cheaperThere are three places to materialize a token's KV cache:| Strategy | Retrieval cost (per token) | Hold cost (per second) | | ------------------------------- | -------------------------- | ------------------------------------ | | Rematerialize (recompute) | N_act / FLOPS × GPU $/s | 0 | | Store in HBM | ≈ 0 (already there) | bytes_tok / HBM_capacity × GPU $/s | | Store in DDR / Flash / Disk | bytes_tok / DDR_bw | bytes_tok × DDR $/s |Optimal tier matches your hold time to the tier's drain time (capacity / bandwidth):| Tier | Drain time | | ------------------- | ------------ | | HBM | ~20 ms | | DDR | ~1-10 s | | Flash (SSD) | ~1 minute | | Spinning disk (HDD) | ~18-22 hours |The fact that some providers offer 5-minute cache pricing AND 1-hour cache pricing strongly implies the back tiers are flash + spinning disk.The million-token wallWhy doesn't anyone ship a million-token-context model? The compute term grows quadratically in attention but with such a small slope you only feel it in the multi-million token range. The actual binding constraint is memory bandwidth: the linear-in-context KV-fetch term.Sparse attention turns linear into roughly √len_ctx (a key DeepSeek result, thank you open source). It's a big improvement, not infinite but push sparsity too hard and quality collapses.Empirically, frontier context lengths plateaued at 100-200K for the last two years. Which tells you that this is the cost-balanced point. To get to a 100M-token context, the kind that Dario has argued could replace continual learning, you'd need a memory-wall breakthrough that doesn't currently exist.Chapter 7: The legible AI economyThe memory wall is the spine and the bottleneck of the whole industry.Once you internalize the two roofline equations and the magic ~300 ratio, surprisingly much of the AI economy becomes legible:| Question | Answer the equations give | | ------------------------------------------------- | ------------------------------------------------------------------------------- | | Optimal batch size? | ~300 × sparsity | | How fast can you serve, minimum? | N_total / mem_bw | | How over-trained are frontier models? | ~100× past Chinchilla | | What's the bytes-per-token of Gemini's KV cache? | ~1.7 KB | | Why are output tokens 5× input? | Decode MFU ≈ 1/5 prefill MFU | | Why no 1M-token contexts? | The memory wall has no roof | | Why are slow storage tiers showing up in pricing? | Hold times match each tier's drain time, flash for minutes, disk for many hours | | Why do MoE layers map to a rack? | All-to-all matches NVLink topology exactly |The thing I keep coming back to is the equality result: pre-training tokens ≈ inference tokens. Every model you ship is exactly large enough that the data you trained it on equals the data it will produce in its lifetime, give or take. The training corpus and the output stream are roughly the same size. That's a strange and beautiful symmetry, and it's a direct consequence of two heuristic equations on a blackboard.What will break first? If sparse attention cracks the memory wall, context lengths leap to 1M. If the FLOPS-to-bandwidth ratio shifts, the optimal batch size moves with it.This episode was my favorite from Dwarkesh, and Reiner Pope deserves massive praise for explaining all these intricate details so cleanly. Everything being laid out on a blackboard was also great for studying along. (Don't worry Dwarkesh the investment for the new studio was definitely worth it.)Anyway. Some thank yous before I sign off.Thank you, Dwarkesh, for fostering an environment where sharing of knowledge from varying disciplines is encouraged and explored in depth. You walk the line between technical and non-technical perfectly. Thank you to Reiner Pope for sharing his understanding and knowledge on this topic; he's a brilliant speaker and clear thinker

Claude Code vs. Codex

2026-01-13
10 min read1,814 wordsartificial intelligencecomputer sciencepersonalwriting

I hadn't touched anything code related for over a year by this point. My daily drivers back then were Cursor paired with Sonnet 3.7/3.5. So when I started work on building a CRM for my firm I was eager to play with all the new toys that were available to me. Opus 4.5 had come out and everyone was raving on how good it was.I paid $20 and purchased a pro plan on Cursor. I rarely used agent mode in the past and always stuck to ask mode because agent mode often broke my codebase and needed a couple passes of cleaning up and manual tinkering to arrive at my intended outcome. I did the same and mainly stuck to ask mode but Cursor would always prompt me to switch to agent mode so it can implement the changes after I reviewed the code. So, little by little I dipped my toes into using agent mode.I started out by using Opus 4.5 and was very happy with the results. Opus was fast and diligent in its work; however, after 1.5 days I looked at my usage and my jaw dropped. I had blown through more than 50% of my allotted monthly usage. There was no way I could daily drive Opus 4.5 in Cursor at this pace. So, the next day I switched to Sonnet 4.5. This is where things started breaking and the model kept making mistakes or overlooked my instructions. Try as I might I'd burned through my monthly usage on day 3 and if I wanted to use more I'd need to pay. I tried out Composer-1 (Cursor's own coding model) but it was too outdated. It made countless mistakes and took me out of the flow. I reverted all changes I made while using Composer-1. I had tasted the capabilities of Opus 4.5 and I just couldn't turn back.https://x.com/cursor_ai/status/1999147953609736464One thing I really enjoyed using in Cursor was their visual editor which they introduced on the 11th of December '25. You could simply click on a component on your website and refer to it directly, changing its color, positioning or size. (You can see the full demo above.) This made tweaking the small things on the website seamless. You didn't need to go through hundreds/thousands lines of code to find where a certain button is and change it or poorly articulate the LLM to find the button you're talking about. You can just refer to the component directly and instruct the LLM for the changes.<p align="center"> <img src="/images/center-the-div-clanker.jpeg" /> </p>At that time, Windsurf (acquired by Devin) came out with their new model SWE-1.5. The model was free to use in Windsurf so I jumped at the opportunity to try it out. I worked with it for a day and concluded that it wasn't up to my standards now that I was spoiled by Opus 4.5. Not even Sonnet 4.5 could quench my thirst so it was no surprise when SWE-1.5 didn't live up to my standards. I uninstalled Windsurf that day since it didn't provide me any value at that point.I already had a Claude subscription and had heard of Claude Code but I hadn't tried it out. So, I said why not and booted up my terminal and installed the Claude Code CLI. I again used Opus 4.5 in Claude Code but something felt different from Cursor although I was using the exact same model. I had this feeling that Claude Code understood my instructions much better and provided the results I wanted. I wrote a CLAUDE.md and gave it my instructions on how I wanted the CRM to look and feel along with the features I wanted and security requirements. The user experience was quite enjoyable from the little nonsensical words it would say while thinking, to how it would show me tips and its internal COT along with the color choices were all very pleasing.Claude Code showed me a clear to-do list and what it was working on as well as all the diffs in how it changed my current code. What struck me most wasn't the model—it was the same Opus 4.5 I'd used in Cursor. It was how Claude Code talked to the model. In other words the harness that drives the model underneath. The transparency helped too. Watching it tick through a to-do list, seeing the reasoning fragments surface while it worked, made me trust the output more.Once in a while I'd give it a list of instructions to execute and it would create a to-do list. It would work on the to-do list step by step and tell me afterwards that it had implemented all my instructions but it was clearly lying. It would forget to implement one task and tell me that it had already done it when it clearly didn't. I started noticing the flaws. I also noticed that I would burn through my 5-hour rate limit in 2 to 3 prompts then have to wait for it to reset. During the Christmas holidays all major AI labs gave 2x rate limits which was great for me but after the holidays I was back to waiting and staring at the terminal.In the meantime between waiting for rate limits I downloaded Antigravity (Google's IDE rivaling Cursor) which had my beloved Opus 4.5. I resumed my work and hit rate limits on Antigravity as well quite quickly. Although it wasn't as good as Cursor it was free and provided some of the best models on the market (except for GPT-5.2). I tried using Gemini 3 Pro on my codebase and it broke everything instantly. After a couple of prompts, I was worried for Gemini's sanity. It was like a locked up prisoner of war that had a mental disorder. I closed Antigravity, tucked Gemini back into the bookcase, and called it a night. Never again.Meanwhile I tried using Claude Code paired with Sonnet 4.5 to not blow through my rate limits as fast as Opus but it was fruitless. I'd prompt Sonnet and it would make the changes I asked for then I would need to prompt it 2-3 times to fix what it did/broke so I was using the same token amount if I was using Opus to begin with. I resorted to using Opus after this experience because all roads led to the same result.I wanted to try Codex paired with GPT-5.2 since I heard that it was better when handling difficult problems. So, I opened up the AI overlord's website and paid my 20$ for the opportunity of using Codex. At this point, I was hesitant on giving long instructions to Opus 4.5 through Claude Code because I knew that it would not implement some of my instructions so, I kept my prompts succinct and specific.At first glance, Codex is a lot uglier than Claude Code. No fun, no colorful play on words or anything, just pure work. You also have to be much more specific in your prompts, it doesn't get what you're saying as well as Claude Code does on its first try. It also doesn't have as good a harness as Claude Code but it is an absolute work horse. It writes better and cleaner code. Which isn't to say that Opus 4.5's code was bad but GPT-5.2 ekes out just a little bit more; however, the great improvement over Opus and Claude Code is the reliability. It doesn't forget your instructions as much as Opus does and this reliability is something I care about a lot. OpenAI is also much more generous with their rate limits compared to Anthropic. I feel like I could actually get decent amount of work done with Codex compared to getting rate limited instantly while using Claude Code. (Anthropic and OpenAI are both hemorrhaging money on the 20$ subscription plan on users like me.) On the same 20$ plan I'd say OpenAI gives 4-5x more usage than Anthropic. Opus is also more verbose so, it spends more tokens saying the same thing as GPT-5.2 does. If we're talking about long form writing Opus 4.5 is a better model.For the listed reasons above, I use a combination of both Claude Code and Codex for different uses. If I'm refactoring the codebase or implementing a long instruction such as translating all the webpages and UI components from english to turkish I use Codex. When I'm designing the frontend and tweaking around with the UI I use Claude Code with the frontend-design plugin. Claude Code offers a better user experience and multi-turn conversation whereas Codex is very good at difficult problems and reliability. While both tools still forget implementing instructions and have a long way to go in terms of capabilities. The progress that has happened in a year is mind blowing. As always, just to reiterate this is the worst that these models are going to be. It's going to get cheaper, faster, more intelligent and capable as the years go on.I didn't really write this post about how to use Claude Code or Codex to 100x your productivity or anything. I just wanted to write about my personal experiences on using the new models as well as the new CLI tools since I've had more than a year away from these models/tools. If you'd like an in-depth guide on what Claude Code can do Sankalp has two great articles that goes in detail on what is available if you scratch below the surface. (Skills, plugins, sub-agents, compaction, MCP, hooks etc.)https://sankalp.bearblog.dev/my-claude-code-experience-after-2-weeks-of-usage/https://sankalp.bearblog.dev/my-experience-with-claude-code-20-and-how-to-get-better-at-using-coding-agents/Remember, these are all my personal experiences. Your mileage may differ from mine, and that's perfectly fine.What was left on the cutting room floor: I used to use Warp terminal which was great because it had built-in AI features and it would automatically fire off terminal commands. After using it for extended periods, though, I noticed what a power-hungry app it really was. I have since switched to Ghostty and talked with an engineer at Warp. He told me energy usage was on their priority list, so I may revisit it in the future. I also tried Mistral's CLI tool Vibes which uses Devstral 2 under the hood, but I didn't use it enough to give it a fair shake to be perfectly honest