How Frontier LLMs are Trained and Served
2026-05-03
23 min read4,599 wordsartificial intelligencecomputer sciencepersonalwritingHow frontier LLMs are trained and servedThis article is based on my handwritten notes from Reiner Pope's blackboard-style interview with Dwarkesh Patel, plus cross-checks against the published transcript and Dwarkesh's flashcards made in preparation for this episode of the podcast.I would suggest opening this up alongside the video as a study companion.Some interviews are conversations. Reiner Pope's session with Dwarkesh isn't really a conversation, it's a summary of how the whole AI economy works, squeezed onto a single blackboard.During this article you will learn the answers to the questions listed below.Why "Slow Mode" doesn't exist as a product.Why MoE layers map cleanly onto a rack.Why pipeline parallelism doesn't really save you much in inference and why Ilya said it's not wise.Why frontier models are over-trained ~100× past Chinchilla optimal.Why Gemini 3.1 charges 50% more above 200K context and why output tokens cost 5× input tokens.Chapter 1: How batch size affects token cost and speedFirst and foremost, don't worry about the math! There's only one idea. A chip running a model is doing two things at once: computing, and moving data around. One of these is always the bottleneck. If you can write down how long each takes, you can predict almost everything else.The whole lecture is built on one inequality:$$
t ;\geq; \max!\left(t_{\text{compute}},; t_{\text{mem}}\right)
$$t is time for one forward pass.The two terms t compute and t mem (mem = memory) expand as:$$
t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPS}}
\quad\quad
t_{\text{mem}} = \frac{N_{\text{total}} + B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{\text{mem_bw}}
$$B is batch size or in other words the number of sequences alive in one forward pass. Not "users." Not "concurrent sessions." Sequences in flight at the moment the model executes the same matmul (matrix multiplication) once. N_active is the active parameter count, the multipliers actually used per token. N_total is total parameter count which means everything sitting in HBM (High bandwidth memory) that has to be paged in.The compute term ignores attention itself, it's a deliberate simplification. The memory term has two sub-terms: a constant for weight fetches, and a term linear in both batch and context for KV-cache fetches. KV cache = B x length of context x bytes per token.mem_bw = memory bandwidthEverything else in this post falls out of those equations.The latency curveLatency versus batch sizeCompute grows linearly from the origin.KV fetch also grows linearly with B, but with a slope set by context length and bytes per token.Weight fetch is a flat constant — it's N_total / mem_bw. It doesn't care how many sequences ride along.The actual latency is the max of the sum of memory terms and compute. On the plot, that's the bold red line, it hugs the memory curve at small batches, then hands off to the compute curve once compute becomes the bottleneck.The two takeaways from this plot:1. There is a hard latency floor. It is N_total / mem_bw. You cannot serve faster than the time it takes to drag every weight from HBM into the compute units once. This is the lower bound, and it's why "Slow Mode" doesn't really exist as a distinct product in LLMs. Providers are already serving as cheaply as they can to stay competitive, and most are subsidizing inference on top of that.2. The crossover is the goal. When the slope of t_KV matches the slope of t_compute, you are simultaneously memory-bound and compute-bound. Either side of that point you're leaving silicon idle. Hitting that intersection is the operational sweet spot.From latency to cost: The cost-per-token plotCost is a different question from latency. The customer doesn't pay for time; they pay rental seconds amortized over tokens served. So divide each curve by B:Cost per token versus batch sizeThe compute curve was linear → becomes flat.KV fetch was linear → also flat.Weight fetch was constant → becomes a hyperbola, plummeting as you grow B.At B = 1 cost goes to infinity (one token shouldering the entire weight fetch). At large B the weight fetches amortize away and cost collapses onto the compute floor.Two things never amortize: compute (every token gets its own matmul) and KV fetches (every sequence brings its own context).Solving for the optimal batch sizeEquate t_compute with the weight-fetch portion of t_mem, ignoring KV for now:$$
\frac{N_{\text{total}}}{\text{mem_bw}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPS}}
$$Rearrange so all hardware sits on one side, all model on the other:$$
\underbrace{\frac{\text{FLOPS}}{\text{mem_bw}}}{\text{hardware}} ;=; \underbrace{\frac{B \cdot N{\text{active}}}{N_{\text{total}}}}_{\text{model}}
$$The ratio FLOPS / mem_bw is asking: for every byte of memory you can move per second, how many math operations can the chip do per second?On a Blackwell GPU, roughly:FLOPS ≈ 4,500 trillion FP4 multiplies per secondmem_bw ≈ 8 trillion bytes per secondEach FP4 weight is half a byteSo the chip can do about 4,500 / 8 ≈ 560 multiplies for every byte of memory bandwidth. But each FP4 weight is only half a byte, so 560 × 0.5 ≈ 280. Round to ~300. This number has barely budged from A100 → H100 → B100; FLOPS and bandwidth scaled together.So:$$
\boxed{;B \geq 300 \times \frac{1}{\text{sparsity}};}
$$For DeepSeek V3, 32 of 256 experts active per token, giving N_total/N_active = 8, which is what gives B ≥ 300 × 8 = 2,400. In practice operators run 2-3× higher because real efficiency lags the roofline. Round number: a 20-millisecond train carrying 2,000-3,000 tokens.The 20-millisecond trainWhy 20ms specifically? Because of HBM drain time, capacity divided by bandwidth is 20ms on essentially every modern HBM generation. On the Rubin generation it is closer to ~288 GB / ~20 TB/s ≈ 15ms.Why does this matter? Because in 20 ms you can read all of HBM exactly once. You don't want to read it twice in one pass, the weight matrices are read-only, and you don't want to re-fetch your KV cache. So 20 ms is the natural cycle time.A "train" departs every drain-time. Any sequences ready when the train pulls in, board it. If the train fills, the rest wait. If it's half-empty, it leaves anyway. Worst-case queueing latency: ~40ms. First train you miss, plus the time to ride the next one.Tokens per second, not concurrent users"Concurrent users" is a fuzzy concept; tokens per second is precise:$$
\text{tok/s} = \frac{B}{t_{\text{train}}} \approx \frac{2{,}000}{15\text{ms}} \approx 128{,}000;\text{tok/s}
$$Gemini's reported traffic is in the hundreds of millions of tokens per second globally. So one optimally-batched serving cell handles roughly one one-thousandth of Gemini's load. Which means: to compete commercially you need at least one one-thousandth of Gemini's traffic. Below that you can't even fill the train.Chapter 2: How MoE models are laid out across GPU racksHere's a mixture-of-experts (MoE) layer:MoE layer layoutTokens enter, a router activates k of E experts per token - where E is the total number of experts in the layer and k is how many fire for each token (DeepSeek: 32 of 256). The ratio k/E is what we call sparsity.Each expert is a full MLP (multi-layer perceptron). It does three things in sequence (to put it simply - "expand → think → compress"):1. Up-projection: Multiply the token's vector by a tall matrix to expand it into a much higher-dimensional space. If the token is a 4,000-dim vector, the up-proj might lift it to 16,000 dimensions.2. Nonlinearity: Apply a function like ReLU, GELU, or SwiGLU element-wise (these are different activation functions which can be ignored right now). This is where the layer can actually compute non-trivial functions; without a nonlinearity, two stacked matrix multiplies would just collapse into one.3. Down-projection: Multiply by another matrix that brings the vector back down to the original dimension (4,000). Now the result is the same shape as the input and can flow into the next layer.The outputs of the chosen experts sum together, and that sum is added back onto the residual stream.To put it concretely:new_token = old_token + MoE(old_token)old_token is the residual stream - the running vector that flows through the entire model, end to end. MoE(old_token) is what this layer contributes. Every attention and MLP layer reads from the residual stream and adds its contribution back. The model's final answer is read off the residual stream once all layers have written into it.The standard layout is expert parallelism, different experts on different GPUs. DeepSeek has 256 experts; a Blackwell rack has 72 GPUs (use 64 for divisibility, ignore the other eight). That's 4 experts per GPU. Routing then becomes an all-to-all traffic pattern: any GPU's token can be sent to any GPU's expert.This is why one rack is a natural boundary. Within a rack, NVLink/NVSwitch connects every GPU to every other GPU at full bandwidth, a perfect fit for all-to-all. Cross a rack and you drop onto the scale-out network at roughly 8× lower bandwidth.Scale-up vs scale-outScale-up versus scale-outThree vendors, three names for the same thing:| Providers | Scale-up (intra-rack) | Scale-out (inter-rack) |
| --------- | ----------------------------- | ---------------------------- |
| NVIDIA | NVLink / NVSwitch | Ethernet (RoCE) / InfiniBand |
| AMD | Infinity Fabric | Ethernet / InfiniBand |
| Google | Inter-Chip Interconnect (ICI) | Ethernet |Scale-up bandwidths are in the multi-TB/s range per GPU with hundreds-of-nanoseconds latency. Blackwell NVLink is around 1.8 TB/s/GPU. Scale-out per GPU is 400-800 Gbps which is about 3× slower in-rack vs out for Blackwell, ~8× in the general bandwidth-comparison.NVIDIA GPUs can send data directly from one GPU to another.TPUs route differently: TPUs need to traverse all TPUs inside a pod to reach a particular target. Starting with TPU v4 (2021), Google added Optical Circuit Switches between TPU blocks, dynamically reconfiguring which blocks are physical neighbors on a per-job basis.Why scale-up domains keep growing| Generation | GPUs in scale-up | Form factor |
|---|---|---|
| Hopper | 8 | Tray |
| Blackwell | 72 | Rack |
| Rubin | ~500 | Rack (much denser) |Hopper-to-Blackwell was mostly a product decision: switch from trays to racks. Blackwell-to-Rubin is roughly a 4× density increase coming from more aggressive cable routing and power delivery. What constrains a rack is power, weight, cooling, and the bend radius of the cables themselves. Modern racks push every one of those to the physical limit.The macro implication is the answer to "why did model sizes only recently start scaling again?" GPT-4 was rumored to be a trillion-ish parameters in 2023. Models meaningfully larger only started shipping in the last six months. The constraint wasn't the algorithm; it was that until the scale-up domain got big enough to host a multi-trillion-param model and its KV cache for thousands of sequences, you couldn't serve it economically.Active parameters are limited by compute cost. Total parameters are limited by scale-up size.That's the equation Google had a head start on: their TPU pods have had large scale-up domains for years.Chapter 3: Pipeline parallelism, micro-batching, and the bubbleExpert parallelism handles one MoE layer. To stretch a model deeper than one rack, you reach for pipeline parallelism: layer 1 on rack 1, layer 2 on rack 2, and so on. Tensor parallelism, cutting along the hidden dimension or FFN (Feed-Forward Network) dimension is a third option in principle, but with experts now small, the math no longer pays off. Tensor-parallel cuts force frequent all-reduce / all-gather operations inside every transformer block, and unless dimensions are huge and the interconnect very fast, the communications overhead eats the benefit.Layer parallelism and expert parallelism, by contrast, split the model into self-contained computational units; each device does meaningful work before having to talk. Layers and experts are the "chunky" dimensions of a transformer, and chunky cuts give better compute-to-communication ratios.When does pipelining hurt scale-up?Set up the ratio of scale-up time to scale-out time. We want scale-up to dominate since it's the precious resource:$$
\frac{t_{\text{scale-up}}}{t_{\text{scale-out}}} ;=; \frac{1}{8},(\text{BW penalty}) ;\cdot; n_{\text{activated experts}} ;\cdot; n_{\text{layers per stage}} ;\cdot; 2 ;\geq; 1
$$The multiplication of 2 is for the all-to-all up and the all-to-all down.The product of the three positive terms needs to beat the 8× bandwidth penalty. Number of activated experts alone is often 8. Add a few layers per pipeline stage and you're comfortably above the bar.This means: pipelining racks together is fine for forward passes; the all-to-all communication stays inside racks, only the residual stream crosses rack boundaries.Inference: No bubblePipeline inferenceIn inference there's no backward pass. Each rack runs its layer; tokens stream forward; the moment a rack finishes batch 0 it picks up batch 1. Set n_micro_batches = n_pipeline_stages and the wraparound is seamless. No bubble. For latency this is neutral, the full forward pass takes the same time whether the layers live on one rack or four, because pipeline stages run sequentially on a given inference. Pipelining just buys you memory capacity per rack.Training: The bubble is unavoidablePipeline parallelism training bubbleIn training the pipeline has to fill, then drain. Forward pass fills the pipe; then you hit a hard stop (between F3 and B3) because backward needs the entire batch's gradient at once; then the backward pass drains it. The hatched regions are the bubble, racks doing nothing while the pipe fills or drains.Why do we hard stop? Because there's an optimal batch size for ML convergence (smaller batches give fresher gradients) and a competing optimum for total training time (smaller batches are worse from a systems perspective). The chosen batch size sits somewhere between these two, creating the trade-off. Once chosen, you do all of it forward, then all of it backward, which is what creates the bubble in the first place.Chapter 4: Why Ilya said, "As we now know, pipelining is not wise."The pipeline bubble has clever workarounds — schedules called zero bubble and one-forward-one-backward interleave the directions to keep racks busy. But this is the part where the famous Ilya line hits: "As we now know, pipelining is not wise."Memory capacity is per rack. If your model doesn't fit, pipelining lets you split it across racks. The famous Ilya line is about the architectural debt it accumulates. Things like Kimi's residual-attention (where each block attends to multiple prior layers' residuals) assumes those residuals are co-located which becomes very hard to implement when those residuals live on different stages. Interleaved sliding-window vs global attention layers create load imbalance. Every constraint slows down research iteration, and in a frontier lab this is a cardinal sin.The bigger memory equationTotal memory per system:$$
\text{Capacity}{\text{mem}} ;=; N{\text{total}} ;+; B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}
$$Per GPU, with E = expert parallelism (e.g. 64) and P = pipelining (e.g. 4 racks):$$
c_{\text{mem}} ;=; \frac{N_{\text{total}}}{E \cdot P} ;+; \frac{b \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{E}
$$Note the second term has only E in the denominator not E · P. The Ps cancel each other out. Here's the how and why: Pipelining lets P racks each hold a different chunk of weights. So weight footprint per rack drops by P. But each rack now needs P different sequences alive in its slot simultaneously to stay busy (that's what micro-batching is). So KV-cache footprint per rack is multiplied by P from the micro-batch and divided by P from sharding the cache across stages. Resulting in net zero.Because B = n_micro · b, and n_micro = P (you need that many micro-batches in flight to keep the pipeline busy). Pipelining shrinks the weight footprint per GPU but does nothing for the KV cache footprint. Once P ≥ 2, KV cache becomes the dominant memory bottleneck per GPU.This is exactly DeepSeek's published recipe for V3 inference and what frontier labs are presumably doing. Maximize expert parallelism inside a single scale-up domain and use very little pipelining. Frontier inference is probably running on a single scale-up.The hidden bandwidth win from bigger scale-upThere's another reason scale-up matters even more than capacity:$$
t_{\text{mem, weights}} = \frac{N_{\text{total}}}{\text{mem_bw}}
$$And mem_bw here is the aggregate memory bandwidth of every GPU loading weights in parallel, which equals (scale-up size) × (per-GPU BW). Per-GPU bandwidth grew about ~1.5-2× per generation. Scale-up size grew 8× from Hopper to Blackwell. Most of the latency improvement came from having more HBM ports loading weights at once, not from faster HBM.Bigger scale-up results in lower-latency inference which makes longer context lengths feasible. We're still bounded by the KV-fetch term although sparse attention helps at the expense of quality, but it doesn't break the memory wall, which is part of why context lengths have plateaued at 100-200K for the past two years.Chapter 5: Reinforcement learning and over-training beyond ChinchillaWhere the 6ND number comes fromEvery estimate of training cost, Chinchilla's, GPT-4's, anything in a scaling-law paper runs through the same formula: 6ND, where N is active parameters and D is training tokens. Here's where the 6 comes from.For a single multiply-add, the forward pass is 2 FLOPs per parameter per token (one multiply + one add). The backward pass is 2× the forward pass, you compute gradients with respect to both input matrices in the matrix multiplication. So 4 FLOPs per parameter per token for backward, or 2 + 4 = 6 in total.Three buckets, one cost equationTotal compute cost across the three stages of model life:$$
C_{\text{total}} ;=; \underbrace{6 N_{\text{act}} D_{\text{PT}}}{\text{pre-train}} ;+; \underbrace{(2 \text{ to } 6), N{\text{act}} D_{\text{RL}} \cdot \text{ineff}}{\text{RL}} ;+; \underbrace{2 N{\text{act}} D_{\text{inf}} \cdot \text{ineff}}_{\text{inference}}
$$The RL coefficient is 2-6 depending on whether you backward-pass on every rollout. The inefficiency factor accounts for the fact that decode (used in RL rollouts and inference) runs at much lower MFU (Model Flops Utilization) than prefill or pure training.The equality heuristicFor costs that trade off, one term grows, another shrinks. The minimum tends to sit where the two terms are equal. The intuition is that at any other point, you're paying more on one side than you're saving on the other, which means there's free movement available.Quick example: f(x) = ax + b/x has its minimum at x = √(b/a), where both terms equal √(ab). This generalizes loosely to any sum of a growing and a shrinking term, which is what training cost (grows with model size) plus inference cost (shrinks with model size, since smaller models serve cheaper) looks like.Applying it here: if a frontier lab is optimizing reasonably well, the three buckets (pre-training, RL, and inference) should satisfy:$$
C_{\text{PT}} \approx C_{\text{RL}} \approx C_{\text{inf}}
$$N_active cancels. With ineff ≈ 1/3 you arrive at:$$
D_{\text{PT}} \approx 1.5 \cdot D_{\text{RL}} \approx D_{\text{inf}}
$$These three numbers should all be in the same ballpark: Pre-training tokens, RL tokens, and lifetime inference tokens.Plugging in some real numbersInference: ~50M tokens/s globally, deployed for 2 months before obsolescence. → D_inf ≈ 2.6 × 10¹⁴ tokens = (~200T).Pre-training (rumored): ~150T tokens. Same order of magnitude. The heuristic checks out.Active parameters: ~100B.Chinchilla optimum: D_chinchilla ≈ 20 × N_active = 2T tokens.Frontier ratio: 200T / 2T = 100× past Chinchilla.Why no one trains Chinchilla-optimal anymoreChinchilla minimizes training compute for a given final loss. But in production:The training bill arrives once. The inference bill arrives for the entire deployment lifetime.A smaller model trained on 50× more data is slightly worse per training-FLOP (because Chinchilla was the optimum under that constraint) but vastly cheaper to serve. So you over-train.The outcome: every model you ship is exactly large enough that the data you trained it on equals roughly the data it will produce in its lifetime. The training corpus and the output stream are roughly the same size. That's a strange and beautiful symmetry, and it's a direct consequence of the equality heuristic.Chapter 6: Deducing long context memory costs from API pricingChapter 6 is Chapter 1 with a different x-axis.Remember the latency plot from Chapter 1? Compute linear in batch, memory mostly flat at small batch then linear at large batch, with the regime transition where they cross? Replace the x-axis with context length and the same plot reappears, just rotated. The crossover at 200K is the same crossover, applied to a different sweep variable. API pricing tracks it because Google's costs do.The 200K crossover in GeminiGemini 3.1 charges 50% more above 200K context length. Here's why:Cost versus context lengtht_compute = B·N_act / FLOPS is essentially flat in len_ctx.t_mem rises linearly with len_ctx (the KV-fetch term grows).The pricing tier tracks the underlying cost envelope. The kink is where memory time crosses compute time. At the crossover:$$
\frac{B \cdot \text{len}{\text{ctx}} \cdot \text{bytes}{\text{tok}}}{\text{mem_bw}} = \frac{B \cdot N_{\text{act}}}{\text{FLOPS}}
$$Solve for bytes per token:$$
\text{bytes}{\text{tok}} = \frac{\text{mem_bw}}{\text{FLOPS}} \cdot \frac{N{\text{act}}}{\text{len}_{\text{ctx}}} = \frac{1}{300} \cdot \frac{100\text{B}}{200\text{K}} \approx 1.7;\text{KB / tok}
$$And we can decompose that:$$
N_{\text{bytes/tok}} = N_{\text{attn layers}} \cdot 2 \cdot d_{\text{head}} \cdot N_{\text{KV heads}}
$$With d_head (dimension of the vector) = 128 and a small KV-head count (1-8), 1.7 KB is consistent with dense attention plus heavy cross-layer KV-cache reuse, the Character.AI / Gemma trick where the global KV is shared across all attention layers. Or via sparse attention. Either way, pricing has leaked architectural information.Competitive pricing pressure forces every API tariff to track its underlying cost structure pretty tightly, if your prices drift too far above your real costs, a competitor with lower costs will eat your margin.Why output tokens cost 5× input tokensDecode versus prefillPrefill processes the entire prompt in parallel. Many tokens per pass. Highly parallelizable. Compute-limited.Decode generates one token per pass, autoregressively. Each token loads the full weights + KV cache. Memory-bandwidth limited.On the time-per-token chart:t_compute / len_pass is flat.t_mem / len_pass is a hyperbola.Both regimes rent the same GPU for the same dollar. What changes is how much of the GPU's FLOPS budget is actually used per second. Prefill saturates the FLOPS, it's compute-bound. Decode underuses them, because the chip is sitting idle waiting for HBM. The 5× pricing ratio is just the ratio of effective FLOPS-utilization between the two regimes.At len_pass = 1 (decode), memory dominates and the cost is ≈ 5×. At large len_pass (prefill), the curve collapses to the compute floor and the cost is ≈ 1×.So the 5× ratio in pricing is telling you the MFU ratio between the two regimes. Decode runs at roughly 20% the model-FLOPs-utilization of prefill.Why cache hits are 10× cheaperThere are three places to materialize a token's KV cache:| Strategy | Retrieval cost (per token) | Hold cost (per second) |
| ------------------------------- | -------------------------- | ------------------------------------ |
| Rematerialize (recompute) | N_act / FLOPS × GPU $/s | 0 |
| Store in HBM | ≈ 0 (already there) | bytes_tok / HBM_capacity × GPU $/s |
| Store in DDR / Flash / Disk | bytes_tok / DDR_bw | bytes_tok × DDR $/s |Optimal tier matches your hold time to the tier's drain time (capacity / bandwidth):| Tier | Drain time |
| ------------------- | ------------ |
| HBM | ~20 ms |
| DDR | ~1-10 s |
| Flash (SSD) | ~1 minute |
| Spinning disk (HDD) | ~18-22 hours |The fact that some providers offer 5-minute cache pricing AND 1-hour cache pricing strongly implies the back tiers are flash + spinning disk.The million-token wallWhy doesn't anyone ship a million-token-context model? The compute term grows quadratically in attention but with such a small slope you only feel it in the multi-million token range. The actual binding constraint is memory bandwidth: the linear-in-context KV-fetch term.Sparse attention turns linear into roughly √len_ctx (a key DeepSeek result, thank you open source). It's a big improvement, not infinite but push sparsity too hard and quality collapses.Empirically, frontier context lengths plateaued at 100-200K for the last two years. Which tells you that this is the cost-balanced point. To get to a 100M-token context, the kind that Dario has argued could replace continual learning, you'd need a memory-wall breakthrough that doesn't currently exist.Chapter 7: The legible AI economyThe memory wall is the spine and the bottleneck of the whole industry.Once you internalize the two roofline equations and the magic ~300 ratio, surprisingly much of the AI economy becomes legible:| Question | Answer the equations give |
| ------------------------------------------------- | ------------------------------------------------------------------------------- |
| Optimal batch size? | ~300 × sparsity |
| How fast can you serve, minimum? | N_total / mem_bw |
| How over-trained are frontier models? | ~100× past Chinchilla |
| What's the bytes-per-token of Gemini's KV cache? | ~1.7 KB |
| Why are output tokens 5× input? | Decode MFU ≈ 1/5 prefill MFU |
| Why no 1M-token contexts? | The memory wall has no roof |
| Why are slow storage tiers showing up in pricing? | Hold times match each tier's drain time, flash for minutes, disk for many hours |
| Why do MoE layers map to a rack? | All-to-all matches NVLink topology exactly |The thing I keep coming back to is the equality result: pre-training tokens ≈ inference tokens. Every model you ship is exactly large enough that the data you trained it on equals the data it will produce in its lifetime, give or take. The training corpus and the output stream are roughly the same size. That's a strange and beautiful symmetry, and it's a direct consequence of two heuristic equations on a blackboard.What will break first? If sparse attention cracks the memory wall, context lengths leap to 1M. If the FLOPS-to-bandwidth ratio shifts, the optimal batch size moves with it.This episode was my favorite from Dwarkesh, and Reiner Pope deserves massive praise for explaining all these intricate details so cleanly. Everything being laid out on a blackboard was also great for studying along. (Don't worry Dwarkesh the investment for the new studio was definitely worth it.)Anyway. Some thank yous before I sign off.Thank you, Dwarkesh, for fostering an environment where sharing of knowledge from varying disciplines is encouraged and explored in depth. You walk the line between technical and non-technical perfectly. Thank you to Reiner Pope for sharing his understanding and knowledge on this topic; he's a brilliant speaker and clear thinker