Frontier LLM Questions and Answers

50 questions sorted from lowest to highest relevance to frontier LLM development. Each answer shows scores from the 5.5 Pro, Opus 4.8, and Gemini 3.1 Pro judges. Purple marks GPT 5.5 Pro answers; green marks Fable answers.

ID	Relevance	Question	GPT 5.5 Pro Answer	Fable Answer
Q005	58	How should I set sequence length, microbatch size, and gradient accumulation on a limited GPU budget?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 68 Treat this as a controlled engineering decision, not a preference. Choose sequence length from required behavior, then solve memory with microbatching, accumulation, FlashAttention, checkpointing, and packing. Do not sacrifice total useful tokens for long context unless long-context capability is an explicit target. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Debug at 1,024 or 2,048 tokens; use 4,096 when code, document QA, or reasoning benefits; extend later to 8k-128k with a specific phase. Optimize global_batch_tokens = sequence_length * microbatch_sequences_per_gpu * accumulation * data_parallel_world_size. Start around 0.25M-1M tokens/update for small runs and 1M-4M for larger pretraining. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Find the largest microbatch that fits with 5%-10% memory headroom, then use accumulation to hit the global batch. Track padding fraction below 5%-10%, tokens/sec, memory reserved versus allocated, recompute overhead, attention kernel selection, and loss by document length. For pipeline parallelism, choose microbatches m at least 4p for p stages. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: OOMs from insufficient headroom, accumulation above roughly 256 microsteps reducing update dynamics, packing without EOS across documents, sequence-length changes without LR/batch retuning, and optimizing examples rather than tokens. Diagnose with memory budget tables and position-wise loss. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Vaswani et al. 2017 attention scaling; Dao et al. 2022/2023 FlashAttention; Narayanan et al. 2021 Megatron-LM; Korthikanti et al. 2022 sequence parallelism..	5.5 Pro judge: 95 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 87 Fix these in order: sequence length from the data and target use, microbatch size from memory, gradient accumulation from the target global batch. Sequence length: for a first pretraining run, 2048–4096 tokens is the sweet spot; attention cost grows quadratically but with FlashAttention-2 (Dao 2023) the attention share at 4k and head_dim 128 is modest (~10–20% of step time for 1–7B models). Do not pretrain at 32k+ from scratch — train at 4k–8k and extend context later (the Llama 3 recipe: 8k for the bulk, long-context extension in late stages). Pack documents to fill sequences (greedy concat with EOS, ideally with cross-document attention masking); unpacked padding can waste 30–60% of compute on web-scale data. Microbatch size (per-GPU sequences per forward pass): set it to the largest value that fits after enabling activation checkpointing, then back off one notch for stability. Memory budget per GPU = weights + optimizer state + gradients + activations. With BF16 weights and FP32 Adam state, static cost is ~16 bytes/param unsharded (2 weight + 2 grad + 12 optimizer); a 7B model needs ~112GB static, so on 80GB cards you must shard (ZeRO-1/2 or FSDP) even before activations. Activation memory per layer scales as s x b x h x (34 + 5 x a x s / h) bytes in BF16 without checkpointing (Korthikanti et al. 2022); full activation checkpointing reduces this to ~2 x s x b x h per layer at the price of a ~33% recompute overhead (recomputation re-runs the forward, and forward is ~1/3 of total FLOPs). Prefer selective checkpointing (checkpoint only attention internals) when it fits — it costs only a few percent. Gradient accumulation: global_batch_tokens = microbatch x seq_len x grad_accum x data_parallel_size. Target global batch sizes from the published record: 0.5M tokens for ~1B models, 1–2M for 7B, 4M for 70B (GPT-3 used 3.2M; Llama 2 used 4M). Below the critical batch size (McCandlish et al. 2018), accumulation is nearly free statistically, so use it to hit the target: e.g., 8 GPUs, microbatch 4, seq 4096 → 131k tokens per step without accumulation; accumulate 8 steps for ~1M. Note accumulation does not save time — it trades wall-clock for batch size — so on a tiny budget prefer the smallest global batch that trains stably (even 0.25M tokens) and spend the savings on more steps; small-batch training at higher LR-per-token is often more sample-efficient below the critical batch size. Practical tuning loop: (1) start microbatch = 1, measure memory headroom with torch.cuda.max_memory_allocated; (2) double until OOM, keep 5–10% headroom for fragmentation and NCCL buffers (set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to fight fragmentation); (3) measure tokens/sec at each setting — throughput usually saturates before memory does because larger microbatches improve kernel efficiency only up to arithmetic-intensity saturation. Failure modes: OOM only at step ~100 (fragmentation or cudnn autotune — pre-allocate or fix benchmark mode); accumulated-gradient runs diverging where unaccumulated ones did not (check that loss is averaged, not summed, across accumulation steps, and that grad clipping is applied after full accumulation); throughput collapse at high accumulation (optimizer step and clipping overheads — fuse with apex/fused AdamW).
Q008	60	How do I use mixed-precision training safely, and when should I choose BF16 instead of FP16?	5.5 Pro judge: 87 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. Prefer BF16 on hardware that supports it because it has FP32-like exponent range and avoids most dynamic loss-scaling failures. Use FP16 only with strict loss scaling, FP32 reductions, and overflow telemetry. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Store forward/backward parameters in BF16 or FP16, keep optimizer states and master weights in FP32 unless a low-precision optimizer is qualified, accumulate sensitive reductions in FP32, and compute softmax statistics, loss reduction, gradient norm, normalization statistics, and router logits with stable kernels. For FP16, start dynamic loss scale around 2^12-2^15, grow after 1,000-2,000 clean steps, halve on overflow, and skip overflowed updates. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Run a 200-1,000 step FP32 or BF16 reference on a small model and compare loss, gradient norms, generated samples, and checkpoint resume behavior. Log loss scale, overflow rate, skipped steps, NaN/Inf counts, max attention logits, and all-masked softmax rows. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: FP16 attention overflow, mask constants becoming -inf in unsafe places, loss-scale collapse hiding LR or data corruption, gradient clipping before unscaling, all-reduce overflow, and nondeterministic skipped steps desynchronizing ranks. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Micikevicius et al. 2018 mixed precision; Fedus et al. 2021/2022 Switch Transformer BF16; Dao et al. 2022 FlashAttention; Narayanan et al. 2021..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 87 Default answer: use BF16 with FP32 master weights and FP32 optimizer state on any hardware that supports it (A100, H100, TPU v3+). BF16 has FP32's exponent range (8 bits), so there is no overflow problem and no loss-scaling machinery; the cost is 8 mantissa bits versus FP16's 11. Every recent large run (PaLM, Llama 2/3, OPT after its FP16 troubles, DeepSeek pre-V3) trained in BF16. Choose FP16 only on V100-class hardware where BF16 is unsupported, and accept the loss-scaling complexity. The safe mixed-precision recipe (Micikevicius et al. 2018, adapted to BF16): weights, activations, and GEMMs in BF16; an FP32 master copy of weights updated by the optimizer; Adam m and v in FP32; gradient accumulation buffer in FP32 (critical — accumulating many microbatch gradients in BF16 loses low-order bits and visibly degrades loss at high accumulation counts); all reductions (gradient all-reduce) in FP32 or at least with FP32 accumulation; softmax, LayerNorm/RMSNorm statistics, and the final cross-entropy/logits computed in FP32. Attention softmax in particular: FlashAttention does the accumulation in FP32 internally — keep it that way. The loss and any log-sum-exp must be FP32; BF16's coarse mantissa (~3 decimal digits) makes BF16 loss values too quantized to compare runs. If you must use FP16: enable dynamic loss scaling — initial scale 2^16, double every 2,000 overflow-free steps, halve on overflow and skip that step. Watch the scale value as a health metric: a healthy run sits at 2^14–2^17; a scale repeatedly collapsing below ~128 means activations or gradients are overflowing structurally (often attention logits) and the run will likely diverge — the OPT-175B logbook (Zhang et al. 2022) documents weeks of this failure mode. Also keep a small epsilon in norms (1e-5) and consider FP32 residual stream if spikes persist. BF16 caveats: (1) BF16 is not free of precision pathologies — long-run BF16 rounding in the residual stream measurably hurts versus FP32 baselines at very long token horizons; the master-weight copy plus FP32 norms covers most of it. (2) Keep RoPE rotation computation in FP32; BF16 position angles at large positions visibly corrupt long-context behavior. (3) Stochastic rounding (TPU/some kernels) helps if you experiment with pure-BF16 optimizer state, but the standard advice is: do not economize on optimizer-state precision unless memory-bound, and if you do, use ZeRO sharding first. FP8 (H100 Transformer Engine) is the current efficiency frontier: E4M3 for forward, E5M2 for gradients, with per-tensor (or better, per-block 128x128) scaling. DeepSeek-V3 (2024) trained a 671B MoE largely in FP8 with fine-grained tile scaling and FP32 accumulation, at <0.25% loss penalty versus BF16. Treat FP8 as an optimization to adopt after a BF16 baseline is stable, never for a first run. Diagnostics: log per-tensor max-abs for activations and grads (overflow precursors), the fraction of skipped steps (FP16), and run a 1B-token BF16-vs-FP32 A/B at small scale once per major codebase change — loss curves should match within ~0.005 nats; larger gaps indicate a precision bug (commonly: norm stats, softmax, or gradient accumulation dtype).
Q002	61	How should I shuffle and stream a large text dataset so that each training batch is sufficiently diverse?	5.5 Pro judge: 85 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Make each optimizer step an unbiased, high-entropy sample from the intended mixture while preserving deterministic replay. Pre-tokenize immutable shards with metadata for domain, language, quality, source, token count, and dedup cluster. Never let a rank or worker own a special long-lived slice of the corpus. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use shard sizes around 128MB-2GB. Shuffle shard order, then use an in-memory shuffle buffer of at least 10,000 documents or 100-1,000 global batches. For mixtures, use a weighted multiplexer and temperature sampling p_i proportional to n_i^alpha, alpha 0.3-0.7 for multilingual and 0.5-1.0 for domain mixes. Cap dominant domains unless specialization is intentional. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Log realized token weights over 1,000-10,000 step windows, batch domain entropy, language entropy, duplicate rate, source ids, and document-length distribution. Run sampler replay tests for the exact world size. Require the loader to sustain at least 1.2x peak token consumption under checkpoint and validation load. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Adjacent chunks from the same document in consecutive batches, length-sorted batches creating topic streaks, every rank reading the same shard, central sampler bottlenecks, and low shuffle buffers causing loss autocorrelation. Diagnose with sample-id intersections across ranks and shard-boundary loss plots. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Raffel et al. 2020 T5/C4; Penedo et al. 2023 RefinedWeb; Soldaini et al. 2024 Dolma; Narayanan et al. 2021 Megatron-LM..	5.5 Pro judge: 94 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 87 The goal is that each global batch is an i.i.d.-like sample from the full mixture, despite the corpus living in thousands of sequential shard files. The standard solution is two-level shuffling: (1) an offline global shuffle at tokenization time, and (2) an online windowed shuffle at load time. Offline: after cleaning and tokenization, write the corpus into shards of 100MB–1GB (e.g., Megatron's indexed .bin/.idx, WebDataset tars, or MosaicML StreamingDataset MDS). Before writing, globally shuffle document order — a distributed sort by random key (Spark/Beam) is cheap relative to tokenization. This breaks source locality: without it, a Common Crawl dump's documents arrive grouped by host, and a batch may be 100% one website. Online: each data-parallel rank reads an interleaved subset of shards. Use shard-level shuffling (shuffle the shard order per epoch with a seeded RNG) plus a per-worker shuffle buffer. Buffer sizing rule of thumb: the buffer should hold at least 10x more documents than the number of documents contributed per batch per worker, and ideally span several shards — 10,000–100,000 documents (or 1–4GB of tokens) is typical. tf.data and torchdata both implement this as shuffle(buffer_size) after interleave(cycle_length=8–16 shards). Too-small buffers produce visible loss periodicity with period equal to shard length — that is your primary diagnostic: autocorrelate the training-loss time series; spikes at shard-sized lags mean insufficient shuffling. Sequence packing interacts with shuffling: when packing documents into fixed-length 2k–8k token sequences, pack after the shuffle, otherwise each sequence contains adjacent (correlated) documents. Use best-fit packing or simple greedy concatenation with EOS separators; mask cross-document attention if your stack supports it (as in Llama 3, Dubey et al. 2024) to avoid attending across unrelated documents. Determinism and resumability matter as much as randomness. Megatron-style solutions precompute a global sample index: a memory-mapped permutation of all samples for the whole run, seeded once. This gives (a) exact reproducibility, (b) O(1) resume to any step without replaying the stream, and (c) guaranteed single-epoch coverage without replacement. MosaicML Streaming and HuggingFace datasets' to_iterable_dataset(num_shards=...) provide similar semantics. Prefer sampling without replacement for less-than-or-around-1-epoch training; for multi-epoch over small high-quality subsets, reshuffle each epoch with epoch-dependent seeds. For multi-source mixtures, do not pre-concatenate sources into one stream. Keep one stream per source and sample sources per-sequence according to mixture weights (a categorical draw per sample, seeded). This lets you adjust weights mid-run without re-sharding and guarantees every batch matches the mixture in expectation. Verify empirically: log the realized source composition of every Nth batch and chi-square it against target weights; drift indicates exhausted sources silently repeating. Other diagnostics: (1) monitor unique-document rate per 1B tokens — a drop means a worker is re-reading shards; (2) per-source validation loss spiking while others fall indicates a starved stream; (3) checksum the first batch after resume against a pre-failure log to validate deterministic resumption. Keep loader workers per rank at 2–8 with prefetch_factor=2 so shuffling never stalls the GPUs.
Q004	61	How do I pick an initial learning rate, warmup period, and decay schedule for stable training?	5.5 Pro judge: 84 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Schedule learning rate by consumed tokens, not epochs, because web-scale corpora are not meaningful IID epochs. Use warmup to stabilize Adam moments, precision, and residual activations; use decay or annealing to trade exploration for consolidation near the end of compute. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Default to linear warmup for 0.5%-2% of planned tokens, with a floor of 1,000-5,000 optimizer steps, then cosine decay to 10% of peak or a small floor. Peak LR priors: about 6e-4 for 100M, 3e-4 for 300M-1B, 1e-4 to 2e-4 for 3B-10B, and 3e-5 to 1e-4 beyond that under AdamW beta=(0.9,0.95). Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Run short LR sweeps for 2,000-10,000 steps and choose the largest LR without sustained spikes. Compare by validation loss at equal tokens. For final anneal, lower LR by 10x-30x over 1%-5% of total tokens on cleaner data while retaining a broad general mixture. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Warmup too short causing early overflows, LR blamed for data-correlated spikes, Adam moments reset accidentally at anneal, and global batch changes without retuning LR. Diagnose with gradient norm, loss scale, activation RMS, clipping fraction, and update norm by layer. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Vaswani et al. 2017 Transformer schedule; Loshchilov and Hutter 2019 AdamW; Brown et al. 2020 GPT-3; Kaplan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021..	5.5 Pro judge: 95 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 92 Treat LR schedule as three decisions: peak value, warmup length, and decay shape — in that order of importance. Peak LR: start from published anchors scaled by model size (6e-4 at 125M down to ~1e-4 at 70B; see GPT-3 Table 2.1, Llama 2). If you have budget for a scan, sweep LR in half-decade steps {1e-4, 3e-4, 6e-4, 1e-3} on a short run of ~20x the warmup length and pick the largest LR that shows no spikes and the best loss at fixed tokens — the loss-vs-LR curve is U-shaped with a sharp right wall, so prefer 2x below the explosion point, not at it. With muP (Yang et al. 2021), tune once at small width (e.g., 39M proxy) and transfer; muP optima are typically stable within 2x across 100x parameter scale. Warmup: linear from ~0 (or peak/1000) to peak over 0.1–1% of total training tokens. Concrete anchors: GPT-3 used 375M tokens of warmup over a 300B-token run (0.125%); Llama-class runs use 2000 steps which at 4M-token batches is ~8B tokens (~0.5%). Longer warmup (up to 3%) is cheap insurance for new architectures or large batches; too-short warmup manifests as loss spikes in the first 1–2k steps, because Adam's second-moment estimate is still biased while curvature is changing fast. Wortsman et al. 2023 ("Small-scale proxies for large-scale Transformer training instabilities") show warmup combined with beta2 = 0.95 and z-loss is the main spike-prevention toolkit. Decay: cosine to a floor of 10% of peak, with the cosine period set equal to the total token budget, remains the default (Chinchilla, Hoffmann et al. 2022, showed the schedule must match the actual run length — a cosine scheduled for 2x the tokens you actually train under-decays and leaves ~loss equivalent of several percent compute on the table). The strong modern alternative is WSD / trapezoidal (warmup-stable-decay; Hu et al. 2024 MiniCPM; Hägele et al. 2024): hold LR constant after warmup, then decay linearly or 1-sqrt to near zero over the final 10–20% of tokens. WSD matches cosine's final loss while letting you (a) extend runs without committing to a horizon and (b) fork annealing branches from the stable phase for data-mixture experiments. For continued pretraining or unknown budgets, prefer WSD; for a fixed-budget single run, cosine-to-10% is well-validated. Decay-to-zero versus 10% floor: recent evidence (Hägele et al. 2024; Llama 3 decayed to ~0) slightly favors decaying close to zero over the last phase; final-phase LR controls the sharpness of the final loss drop. Diagnostics and failure modes: (1) loss spikes during warmup → lengthen warmup 2–4x or lower peak 2x; (2) spikes mid-run that self-recover → usually data or precision, but persistent ones argue for 30% lower peak; (3) loss plateau early with very low grad norm → LR too low, you are wasting compute; (4) validation loss diverging from training loss only during the decay phase → not an LR problem, check data. Always log LR alongside loss and grad-norm so spike timing can be aligned with schedule phase, and checkpoint immediately before the decay phase begins so you can re-anneal with different data mixtures later.
Q003	62	How do I choose an optimizer and its hyperparameters for a first transformer language-model training run?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Use AdamW as the first optimizer for a decoder-only transformer unless a proxy study proves another optimizer wins under equal tokens and active FLOPs. AdamW is robust, debuggable, supported by fused/distributed implementations, and compatible with BF16, FSDP, ZeRO, tensor parallelism, and standard scaling-law experiments. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start AdamW with beta1=0.9, beta2=0.95, epsilon=1e-8, decoupled weight_decay=0.1, and true global gradient clip=1.0. Exclude biases, norm scales, and often embeddings from weight decay. Sweep peak LR {1e-4,2e-4,3e-4,6e-4} for sub-billion models and {3e-5,6e-5,1e-4,2e-4} for multi-billion models. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Compare validation loss after the same number of tokens, plus gradient norm, clipping fraction, update-to-weight ratio, moment RMS, and stability under checkpoint resume. A healthy mature run clips less than about 1%-5% of steps after warmup. Keep optimizer states in FP32 unless a low-precision optimizer path has been separately qualified. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Exotic optimizers hiding data bugs, beta2 too high after mixture changes, LR chosen by wall-clock rather than tokens, local instead of global clipping, and weight decay applied to norm scales. Diagnose with layerwise update/weight ratios around 1e-4 to 1e-2 and loss-spike attribution. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kingma and Ba 2015 Adam; Loshchilov and Hutter 2019 AdamW; Shazeer and Stern 2018 Adafactor; Vaswani et al. 2017 Transformer; Hoffmann et al. 2022 Chinchilla..	5.5 Pro judge: 94 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 87 Use AdamW. It remains the default for transformer LM pretraining and virtually every reported frontier run (GPT-3, Llama 1–3, PaLM used Adafactor as the notable exception). Recommended starting configuration for a decoder-only model from roughly 100M to 10B parameters: beta1 = 0.9, beta2 = 0.95, eps = 1e-8, weight decay = 0.1 (decoupled, per Loshchilov & Hutter 2019), gradient clipping at global L2 norm 1.0. The beta2 = 0.95 choice (rather than 0.999) comes from GPT-3 (Brown et al. 2020) and is now standard: large-batch LM gradients are noisy and heavy-tailed, and a shorter second-moment horizon reacts faster to loss spikes, materially reducing instability. eps matters more than people expect: with BF16 master-weight setups, 1e-8 is safe; some groups raise it to 1e-15 only with FP32 state — do not lower eps casually, since v-hat values near eps amplify updates. Peak learning rate scales inversely with model width. Empirical anchors from the published record: 6e-4 for ~125M, 3e-4 for ~1.3B, 2e-4 for ~7B, 1.5e-4 for ~13B, 0.6–1e-4 for ~70B (GPT-3 table 2.1; Llama 2, Touvron et al. 2023). A reasonable fit is LR roughly proportional to 1/d_model or to N^-0.32 in parameters. If you can afford it, use muP (Yang et al. 2021, "Tensor Programs V") to transfer the LR from a small proxy: tune on a 40M-width-scan, then the muP-parameterized optimum transfers across width, removing the riskiest hyperparameter at scale. Apply weight decay only to matrix weights — exclude biases, LayerNorm/RMSNorm gains, and (commonly) the embedding — implemented as two parameter groups. Note the interaction: with cosine decay, the effective regularization is LR x wd, so wd = 0.1 at LR 3e-4 is mild; do not also add dropout for sub-1-epoch pretraining. Alternatives worth knowing: Adafactor (Shazeer & Stern 2018) saves optimizer memory but underperforms tuned AdamW at small scale; Lion (Chen et al. 2023) needs ~3–10x smaller LR and ~3x larger wd, with mixed reports at scale; Sophia and Shampoo/SOAP show 1.3–2x claimed speedups but add complexity; Muon (Jordan et al. 2024, used in Kimi K2 at 1T-parameter scale with the MuonClip variant) is the strongest recent challenger — but for a first run, AdamW's failure modes are the best documented, which is worth more than 20% efficiency. Practical details: keep optimizer state in FP32 even when training in BF16 (m, v, and a master copy of weights — 12 bytes/param with mixed precision, the dominant memory cost; shard it with ZeRO-1 if needed). Initialize step count correctly on resume or the bias-correction terms will spike the update size. Diagnostics: log the global gradient norm and update-to-weight-ratio per layer; healthy runs show grad norm settling to a stable band (e.g., 0.2–0.6) after warmup. Frequent clipping (>1–2% of steps) means LR too high. A slowly rising grad norm late in training with falling loss is normal under cosine decay; sudden 10x spikes correlated with specific data shards point at data, not the optimizer.
Q009	62	How should I checkpoint training so that I can recover reliably after hardware or software failures?	5.5 Pro judge: 87 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Checkpoint the exact mathematical training state, not merely model weights. A recovery should resume with the same consumed tokens, optimizer moments, RNG streams, sampler state, topology metadata, and scheduler state. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Save parameters, optimizer states, LR scheduler, FP16 scaler, Python/NumPy/framework/CUDA RNG, data-loader cursor, gradient-accumulation microstep, tokenizer/config hashes, code commit, container id, and manifest. For large runs, save sharded checkpoints with two-phase commit: temporary shard files, per-shard checksums, then atomic manifest publish. Cadence: every 30-120 minutes for expensive runs plus milestones. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Continuously restore latest checkpoints in side jobs, train 10-100 steps, and compare LR, loss, consumed-token count, and sample ids to uninterrupted training when possible. Measure save and restore bandwidth. Keep last 5-10 checkpoints plus milestone retention. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Missing RNG state, duplicated/skipped data after resume, partial optimizer shard mismatch, MoE expert placement mismatch, checkpoint files truncated silently, and validation drift after restore. Diagnose with manifests, checksums, sampler replay, and topology-aware converters. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Rajbhandari et al. 2020 ZeRO; Narayanan et al. 2021 Megatron-LM; Huang et al. 2019 GPipe; Narayanan et al. 2019 PipeDream; PyTorch FSDP distributed checkpointing practice..	5.5 Pro judge: 96 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 91 Design checkpointing around a target: lose at most 0.5–1% of compute to (save overhead + expected replay after failure). With per-GPU failure rates implying a cluster MTBF of hours at scale (Llama 3 averaged one interruption every ~3 hours across 16k H100s; Dubey et al. 2024), the optimal interval follows the Young/Daly approximation: interval = sqrt(2 x MTBF x save_cost). If a save costs 30s of stall and MTBF is 4 hours, checkpoint roughly every 25–30 minutes. On an 8–64 GPU setup with MTBF in days, every 1–2 hours is fine. What to save — all of it, exactly: (1) model weights (the BF16 working copy or FP32 master — save the master); (2) full optimizer state (Adam m, v; this triples checkpoint size — a 7B model is ~14GB weights but ~84GB full state); (3) LR-scheduler step and RNG states for every rank (CPU, CUDA, and dataloader RNGs); (4) dataloader position — with a Megatron-style precomputed sample index this is just the consumed-sample count; otherwise persist the full sampler state; (5) the gradient-scaler state (FP16); (6) config, code commit hash, and data-manifest hash alongside, so a checkpoint is self-describing. A resume is only correct if the loss curve continues within noise of the pre-failure trajectory — test this deliberately: kill a run mid-step, resume, and overlay curves; any visible kink means hidden state was lost (the usual culprits: dataloader order, RNG for dropout, or scaler state). How to save without stalling: (1) Sharded/distributed checkpoints — each DP rank writes its own shard (torch.distributed.checkpoint, DeepSpeed, Megatron dist-ckpt) rather than gathering to rank 0; gathering a 70B model to one node takes minutes and can OOM. (2) Asynchronous two-phase saves: snapshot tensors to host RAM (seconds), then a background thread streams to durable storage; PyTorch async DCP and projects like Gemini/CheckFreq formalize this; Llama 3 used in-memory snapshots to cut effective save stalls to seconds. (3) Hierarchical durability: frequent cheap checkpoints to local NVMe or RAM (survive process crashes), every Nth to parallel filesystem or object store (survive node loss), and a daily copy to a separate region/bucket (survive systemic mistakes). Keep the last 3–5 rolling checkpoints plus permanent "milestone" checkpoints every ~50–100B tokens for later science (annealing experiments, evals, rollback below). Rollback is a first-class use case, not just crash recovery: loss spikes that do not self-heal are standard practice to handle by restoring 1–2 checkpoints back and skipping or reshuffling the offending data window (PaLM, Chowdhery et al. 2022, documented exactly this procedure). This requires checkpoints frequent enough that rolling back costs <1 day and a dataloader that can skip forward deterministically. Validation and hygiene: checksum shards on write and verify on read (silent corruption on parallel filesystems is real); store optimizer-state dtype and parallelism layout in metadata; use a checkpoint format that supports resharding (load a TP=8,PP=2 checkpoint into TP=4,DP=2x) — torch DCP and Megatron's distributed format do, naive torch.save does not. Automate resume: a watchdog that detects job death, requeues, and resumes from the latest verified checkpoint converts a 3 a.m. failure from eight lost hours into twenty lost minutes.
Q001	63	How should I split training, validation, and test data to prevent leakage while preserving representative coverage?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 53 Treat this as a controlled engineering decision, not a preference. Split by provenance unit before tokenization or packing: document, URL canonicalization group, repository, book, paper, conversation thread, benchmark problem family, or user/account. Cluster near-duplicates with MinHash or SimHash over 5- to 13-gram shingles and assign the entire cluster to exactly one split. Quarantine public benchmarks, paraphrases, answer strings, and synthetic derivatives from all train and post-training data. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use deterministic hash assignment on stable source ids. Reserve a frozen validation set of 5M-50M tokens for pilots or about 0.01%-0.1% of pretraining tokens for frontier runs, stratified by domain, language, time, quality, and source. Keep a separate non-public test set that mixture designers cannot tune on. Use temporal splits when freshness matters. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Compare train/validation/test distributions with Jensen-Shannon divergence or population stability index; investigate JS above about 0.05 or PSI above 0.1. Report validation loss by stratum and contamination-risk score. Store split manifests with raw snapshot id, filter commit, dedup thresholds, random seed, and token counts. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Chunk-level random splitting, post-split dedup only, benchmark leakage through synthetic explanations, repository forks split across train and test, and changing validation sets mid-run. Diagnose with exact hashes, n-gram overlap, embedding retrieval, URL canonicalization, and cluster-id audits. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Brown et al. 2020 GPT-3 Appendix C; Dodge et al. 2021 C4 documentation; Penedo et al. 2023 RefinedWeb; Soldaini et al. 2024 Dolma; Carlini et al. 2021/2022 memorization..	5.5 Pro judge: 95 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 86 Split at the document level, never the line or chunk level. The most common leakage mode is near-duplicate text straddling the train/test boundary: a Wikipedia article in train and a mirrored copy in test inflates apparent generalization. So the splitting pipeline is: (1) exact-deduplicate the corpus (SHA-1 over normalized text); (2) near-deduplicate with MinHash-LSH at Jaccard similarity around 0.8 over 13-gram shingles, following Lee et al. 2021 ("Deduplicating Training Data Makes Language Models Better"); (3) only then carve out validation and test sets. After splitting, run an additional decontamination pass: remove any training document sharing an n-gram overlap (typically 13-gram, as in GPT-3 and PaLM contamination analyses) with your held-out sets and with downstream benchmarks (MMLU, GSM8K, HumanEval, etc.). Stratify the held-out sets by source so coverage is representative: sample validation documents proportionally from each domain (web, code, books, multilingual shards), not uniformly from the concatenated stream. A practical recipe is one validation shard per data source, each 50–200M tokens, plus a combined mixture-weighted validation set that matches the training mixture exactly — the latter tracks training loss comparably, the per-domain shards tell you which domain is regressing. Keep validation small relative to compute: evaluating 100M tokens on a 1B-parameter model costs roughly 2 x N x D = 2e17 FLOPs, negligible, but at 100B+ parameters cut to 10–20M tokens per shard and evaluate every few thousand steps. For temporally structured data (news, GitHub, forums), prefer a temporal split — test on documents dated after the training cutoff — because random splits leak future information through duplicated or quoted content and overstate quality. For code, split by repository, never by file: files within a repo share boilerplate, licenses, and copied utilities. For multilingual corpora, split per language before mixing, otherwise low-resource languages may end up with zero test coverage. Size rules of thumb: validation loss estimates have standard error roughly sigma/sqrt(n_tokens-effective); 5–10M tokens per domain gives loss estimates stable to about ±0.003 nats, sufficient to detect mixture regressions. Test sets you intend to report on should be at least as large and must be touched rarely — keep a quarantined "final test" used only at the end, and a "dev test" for iteration, to avoid adaptive overfitting via repeated peeking. Failure modes and diagnostics: (a) validation loss suspiciously below training loss → duplicates across the split; grep top-k nearest neighbors of low-loss validation documents in the training set using an embedding index or BM25. (b) Validation loss drops in a step-function when a new data shard enters training → that shard contaminates validation. (c) Benchmark scores jumping more than ~2 points between adjacent checkpoints often indicates contamination rather than capability; verify with perplexity on the benchmark's exact strings (memorized items show anomalously low per-token loss). Log per-document loss histograms on validation; a bimodal histogram with a near-zero-loss mode is the canonical signature of leakage. Finally, version-control the split manifests (document IDs, hashes) so the split is reproducible and auditable across runs.
Q006	64	What training metrics should I log to detect overfitting, instability, and data-pipeline problems early?	5.5 Pro judge: 87 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 67 Treat this as a controlled engineering decision, not a preference. Log optimizer health, model behavior, and data-system integrity as invariants. A frontier run should be diagnosable from telemetry without guessing whether a regression came from data, schedule, precision, hardware, or evaluation drift. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: At each step or short interval log training loss, token-normalized loss, LR, gradient norm pre/post clipping, clipping bit, update norm, weight norm, tokens/sec, MFU, memory allocated/reserved, overflow/skip count, Adam moment RMS, and per-layer activation/residual RMS every 100-1,000 steps. Add max logit and output entropy. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Maintain validation loss by domain, language, source, quality bucket, sequence length, and benchmark family. Log generated samples from fixed prompts with fixed decoding. Track exact/near duplicate rate and effective document repeats. For data, log realized mixture, padding, byte-fallback fraction, language-id distribution, shard ids, and sample-id intersections across ranks. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Aggregate validation hiding low-resource collapse, loss spikes from corrupted shards, FP16 overflows misdiagnosed as LR issues, every rank reading identical data, validation manifest drift, and p99 hardware stalls. Diagnose by correlating anomalies with source ids, ranks, schedule phase, and sequence length. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021; Liang et al. 2022 HELM; Brown et al. 2020 contamination analysis..	5.5 Pro judge: 96 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 90 Log four families of metrics: optimization health, data-pipeline health, evaluation, and systems. The cost is negligible; the value is catching a doomed run in hours instead of days. Optimization: (1) training loss per step, plus a smoothed EMA; (2) global gradient L2 norm — the single best instability early-warning; healthy runs settle into a band (often 0.1–1.0) after warmup, and spikes precede loss spikes by 0–10 steps; (3) per-layer grad norms and update-to-weight ratios (target ~1e-3; layers drifting to 1e-2 are about to blow up, layers at 1e-5 are dead); (4) clipping frequency — if more than ~2% of steps clip, LR is too high; (5) max attention logit per layer (Dehghani et al. 2023, ViT-22B: logits growing past ~50 predict attention-entropy collapse; mitigate with QK-LayerNorm); (6) output-logit max and z-loss value if used (PaLM, Chowdhery et al. 2022, added z = 1e-4 x log^2 Z specifically because softmax-normalizer drift caused spikes); (7) parameter L2 norm per layer — monotone unbounded growth means weight decay is misconfigured; (8) loss-scale value if FP16 (repeated halving to <128 signals overflow trouble). Overfitting detection: validation loss on a mixture-matched held-out set every 500–2000 steps, plus per-domain validation losses. In single-epoch pretraining classic overfitting is rare; what you actually catch is (a) train-val gap widening on a domain you are multi-epoching (e.g., 4+ epochs over a small high-quality set; beyond ~4 epochs repeated data yields sharply diminishing returns, Muennighoff et al. 2023), and (b) memorization: a falling histogram tail of per-token training loss (many near-zero-loss tokens) with flat validation loss. Log the full per-token loss histogram, not just the mean. Data-pipeline problems: (1) realized batch composition by source versus target mixture weights (chi-square alarm); (2) unique-document fraction per window (re-reads indicate a broken sampler); (3) token-frequency drift — a sudden change in the rate of <pad>, <eos>, or non-ASCII tokens usually means a bad shard entered; (4) dataloader wait time per step — if the GPU waits on data more than 1–2% of step time, fix prefetching; (5) sequence-length and packing-efficiency distributions. The canonical pipeline failure signature is a periodic sawtooth in training loss with period equal to a shard or epoch boundary, or a step-change in loss when a new data stage begins — always annotate data-stage transitions on the loss curve. Evaluation beyond loss: cheap online probes every few thousand steps — perplexity on fixed canary sets (code, math, multilingual), a 0-shot accuracy suite small enough to run in minutes (LAMBADA, HellaSwag subsample, PIQA), and n-gram-repetition rate of greedy samples from a fixed prompt set (rising repetition is an early degeneration signal). Systems: tokens/sec and MFU per step, per-rank step time (straggler detection — the max/median ratio should stay under 1.05), NCCL/communication wait time, GPU memory high-water mark, ECC/XID error counts, and host dataloader CPU utilization. Practice: log everything at step granularity to W&B/TensorBoard with the data manifest hash, code commit, and config; alert (not just plot) on grad-norm > 3x trailing median, val-loss regression > 0.01 nats, mixture drift, and rank-time skew. Most large-run postmortems (e.g., the OPT-175B logbook, Zhang et al. 2022) trace failures to signals that were logged but not alarmed.
Q010	65	How do I build a small pilot run that validates my data, model, and training loop before scaling up?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 67 Treat this as a controlled engineering decision, not a preference. A pilot run should falsify the full training system before scale. Downscale parameters and cluster size, but keep tokenizer, packing, masks, loss, optimizer, logging, checkpointing, data-mixture code, and evaluation semantics identical to the intended large run. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start with 10M-100M parameters on 10M-200M tokens, then 300M-1B on 1B-10B tokens if the final run is expensive. Sweep only high-leverage factors: LR {1e-4,3e-4,6e-4}, batch {0.25M,1M tokens}, weight_decay {0.03,0.1}, sequence length, and mixture. Require throughput rehearsal for 6-24 hours. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Run four preflight tests: overfit a small batch to near-zero loss, random-label/shuffled-token training near log(vocab), tiny clean corpus against a reference, and at least three checkpoint restores including mid-accumulation if supported. Inspect fixed-prompt samples, evals, and data-loader stress at the exact planned world-size semantics. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Successful loss curve despite data leakage, untested restore, pilots using a different tokenizer or packing path, logging backpressure, benchmark contamination, and over-tuning many factors without factorial design. Exit only when no NaNs, stable loss, frozen validation manifests, and throughput within 10%-20% of model are demonstrated. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021; Raffel et al. 2020; Penedo et al. 2023..	5.5 Pro judge: 96 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 87 A pilot run's purpose is to falsify cheaply: prove the data, model code, and training loop are correct before committing real compute. Structure it as a ladder of increasingly expensive tests, each with explicit pass criteria. Stage 0 — correctness without training. (a) Overfit a single batch: a healthy model/optimizer drives loss to near zero on 32 repeated sequences within a few hundred steps; failure means a bug in loss masking, label shifting (the classic off-by-one: logits[t] must predict token[t+1]), or optimizer wiring. (b) Check step-0 loss equals ln(vocab_size) ± 0.1 (10.82 for 50k vocab) — deviations expose init or data bugs. (c) Gradient-flow check: after one backward pass, every parameter that should have a gradient does (catch frozen embeddings, detached graphs). (d) Determinism check: two runs with the same seed produce bitwise-identical losses for 50 steps. (e) Parallelism equivalence: if using TP/PP/FSDP, verify loss matches single-GPU execution within 1e-3 over 100 steps on a fixed batch. Stage 1 — a small real model on real data. Train a 50–150M-parameter model (e.g., 12 layers, d_model 768) on the actual production data pipeline — not a toy dataset — for 5–20B tokens (a few hundred GPU-hours). Pass criteria: (a) final loss within a few percent of published comparables (e.g., a 124M GPT-2-class model should reach ~3.0–3.4 loss on a clean web mix at 10B+ tokens; calibrate against Pythia checkpoints, Biderman et al. 2023, which exist exactly for this); (b) smooth loss curve with no unexplained spikes; (c) the data-mixture composition logged per batch matches targets; (d) samples from the model at 5B tokens are grammatical English (qualitative but catches tokenizer/detokenizer bugs instantly); (e) cheap evals (LAMBADA, HellaSwag) above chance and trending up. Stage 2 — scaling sanity. Train 2–3 sizes (e.g., 70M, 160M, 410M) at fixed tokens-per-parameter (~20, Chinchilla-style) and check the losses fall on a clean power law; a kinked or flat scaling curve is the most sensitive integration test there is — it catches subtle issues (bad init scaling, LR not adjusted with width, data repetition) that single runs hide. If using muP, this is also where you validate hyperparameter transfer: LR optimum should be flat across the width scan. Stage 3 — systems pilot. Run the actual target configuration (parallelism layout, precision, checkpointing) for 12–24 hours at full scale or the largest affordable slice. Measure: MFU (flag if below ~35–40% on A100/H100 for dense models), per-rank step-time skew (<5%), dataloader wait (<1%), checkpoint save/restore round-trip (kill the job on purpose and verify the resumed loss curve overlays the original), and memory high-water mark with ≥5% headroom. Budget guidance: the whole ladder should cost 1–5% of the main run's compute. Most catastrophic large-run failures in the public record — OPT's cascading instabilities, BLOOM's early data issues — were of classes detectable at Stage 0–1. Keep every pilot artifact (configs, seeds, loss curves) as the reference baseline: when the big run misbehaves, the first question is "does the pilot reproduce it at 1/100th cost?", and that question must be answerable in an afternoon.
Q007	66	How should I initialize model weights and normalization layers to avoid exploding or vanishing activations?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Use initialization and normalization to keep the residual stream variance stable through depth. For modern decoder-only models, default to pre-norm RMSNorm or LayerNorm, residual-output scaling, and fan-in-aware linear initialization. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Set norm scale=1, bias absent or 0, epsilon 1e-6 or 1e-5, and no weight decay on norm scale. Initialize q/k/v and MLP input projections with fan-in or std near 1/sqrt(d_model); initialize attention output and MLP down projections with an additional 1/sqrt(2L) factor for L blocks. Embeddings often use std 0.01-0.02. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Before training, run real batches and check initial loss near log(vocab_size), e.g. about 11.5 nats for 100k vocabulary if logits are near uniform. Log residual RMS, attention output RMS, MLP output RMS, max logit, gradient norms, and activation histograms by layer over the first 1,000 steps. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Post-norm instability at depth, residual RMS growing exponentially, RoPE attention scale applied twice or omitted, router logits biased toward a few MoE experts, norm epsilon mismatch between training and inference kernels, and weight decay shrinking norm scales. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Vaswani et al. 2017; Ba et al. 2016 LayerNorm; Zhang and Sennrich 2019 RMSNorm; Wang et al. 2022 DeepNet/DeepNorm; Touvron et al. 2023 LLaMA..	5.5 Pro judge: 94 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 87 Modern decoder-only transformers with pre-normalization are forgiving, but initialization still sets whether your first 1,000 steps are stable. The standard recipe (GPT-2 lineage, used with minor variants by GPT-3, OPT, Llama): (1) All linear and embedding weights: normal(0, 0.02) truncated at ±3 sigma. The 0.02 figure is tuned for d_model around 768–2048; the more principled width-aware version is std = sqrt(2/(5 x d_model)) (Nguyen & Salazar 2019, "small init"), or simply 1/sqrt(d_model)-scaled schemes. For d_model ≥ 4096, plain 0.02 is too large — scale it down with width (e.g., 0.0089 at d_model 8192, following std proportional to d_model^-0.5). (2) Residual-output projections (the second FF matrix and the attention output projection W_O): scale the init down by 1/sqrt(2L) where L is the number of layers (GPT-2, Radford et al. 2019). Rationale: the residual stream sums 2L branch outputs; without this scaling, activation variance grows linearly in depth and the final-layer activations explode at L ≥ 40. (3) Norm layers: gain initialized to 1, bias to 0 (RMSNorm has no bias). Final LayerNorm before the LM head: gain 1. (4) Biases everywhere: 0. (Most modern stacks drop biases entirely — PaLM, Llama — which slightly improves stability and throughput.) (5) LM head: if untied from the embedding, init to ~0 or very small (std 0.001) so initial logits are near-uniform and initial loss equals ln(vocab_size) — verify this number at step 0: for a 50k vocab expect loss ~10.8, for 128k ~11.8. A step-0 loss far from ln V is the single fastest diagnostic of broken init or a data bug. Pre-norm (norm before attention/FF, residual unnormalized) is what makes depth trainable without warmup tricks (Xiong et al. 2020); use it. Post-norm trains slightly better when it trains but is unstable past ~18 layers. Hybrids worth knowing: sandwich norm, and OLMo 2's reordered norm placements for spike reduction; QK-norm (LayerNorm on queries and keys, Dehghani et al. 2023) prevents attention-logit explosion and is now common at ≥30B scale. Principled alternative: muP (Yang et al. 2021). Parameterize init variance and per-layer LR by width so that feature-learning dynamics are width-invariant; init hidden weights at variance 1/fan_in, scale output-layer LR by 1/width, multiply attention logits by 1/d_head instead of 1/sqrt(d_head). Payoff: hyperparameters tuned on a small proxy transfer to the large model (used in Cerebras-GPT, MiniCPM, and reportedly several frontier efforts). Other depth-aware schemes — DeepNet's residual scaling for 1000 layers (Wang et al. 2022), ReZero — are options if you push depth. Diagnostics at init, before any training: do one forward pass and log per-layer activation RMS — it should be roughly constant (within 2x) across depth; exponential growth means the residual scaling is missing. One backward pass: per-layer gradient norms should also be flat in depth; vanishing-by-depth indicates post-norm or a normalization bug. Check attention entropy at init (should be near-uniform, ~ln seq_len). During the first 500 steps, watch for activation RMS in the final layers drifting upward faster than earlier layers — the precursor to the classic step-2k loss explosion; the fixes are smaller init on residual projections, QK-norm, or z-loss.
Q019	66	What regularization methods are useful in LLM pretraining, and how should I tune dropout and weight decay?	5.5 Pro judge: 87 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. For large LLM pretraining, most regularization comes from data scale, deduplication, mixture diversity, weight decay, and annealing; dropout is usually zero or small. Tune regularization by scale and repetition, not habit. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Default for modern decoder pretraining: attention_dropout=0, residual_dropout=0, embedding_dropout=0, AdamW weight_decay=0.1. Sweep weight_decay {0.03,0.1,0.3}. For small data, small models, or repeated corpora, try dropout 0.05-0.1. Exclude biases, norm scales, and often embeddings from weight decay. Avoid label smoothing for causal LM unless a controlled study supports it. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Track train/validation gap by domain, effective epochs per dedup cluster, weight norms, norm scale distributions, rare-token loss, and downstream transfer. For MoE, tune router load-balancing, z-loss, capacity, and router noise separately from dropout. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Dropout slowing large-data optimization, too much weight decay shrinking embeddings, too little regularization memorizing repeated high-quality sources, label smoothing hurting calibration, and MoE load loss fighting specialization. Diagnose with domain-specific overfit curves and duplicate exposure counts. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Srivastava et al. 2014 dropout; Loshchilov and Hutter 2019 AdamW; Brown et al. 2020; Hoffmann et al. 2022; Lee et al. 2022 deduplication; Fedus et al. 2021 Switch Transformer..	5.5 Pro judge: 95 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 93 The honest summary: in the modern single-epoch, web-scale pretraining regime, the data is the regularizer, and the only two knobs you should actively use are weight decay and (implicitly) batch/LR noise. Classic regularization — dropout, label smoothing — is mostly off. Weight decay: use decoupled AdamW weight decay (Loshchilov & Hutter 2019) at 0.1. This is the near-universal published value (GPT-3, Llama 1–3, PaLM at 1.0-with-LR-coupling being the odd one out, Chinchilla). Apply it only to matrix weights: exclude norms gains/biases, all biases, and usually embeddings (two parameter groups). Two non-obvious facts: (a) at LM scale, weight decay's benefit is not classical regularization — train loss typically improves too; it acts through optimization dynamics, effectively modulating the per-layer learning rate and keeping weight norms in a healthy band (D'Angelo et al. 2023, "Why Do We Need Weight Decay in Modern Deep Learning?"); (b) what matters is the product LR x wd along the schedule, so if you halve peak LR, consider doubling wd to keep the equilibrium weight norm similar. Tuning range: {0.01, 0.033, 0.1, 0.2}; sensitivity is modest within this band — a half-decade error costs well under 0.5% loss. Diagnostic: log per-layer parameter L2 norms; under correct decay they rise then plateau; unbounded linear growth means decay is effectively off (check the parameter-group filter), collapsing norms with stagnating loss mean decay is too strong for your LR. Dropout: set to 0.0 for pretraining at ≤1 epoch over a large corpus — the practice of GPT-3-era models has been abandoned by the modern lineage (Llama 1–3, Mistral, PaLM, Chinchilla all train with dropout 0). Each token is seen once; suppressing co-adaptation costs capacity with no overfitting to prevent, and measured effect of dropout 0.1 in this regime is a strict slowdown. The exceptions where dropout earns its keep: (a) data-constrained training with ≥4 epochs of repetition — dropout 0.1 recovers some of the multi-epoch degradation (consistent with Muennighoff et al. 2023's data-constrained scaling analysis); (b) small models on small corpora (sub-1B, sub-50B tokens); (c) fine-tuning, especially SFT on small sets (0.05–0.1 on attention and FFN, or rely on LoRA dropout). If you enable it, apply to attention probabilities and FFN activations, not embeddings, and remember it interacts with activation checkpointing (RNG state must be saved for exact recompute). Label smoothing: do not use for autoregressive pretraining (it biases the loss floor and distorts calibration/perplexity; no modern LM uses it). Z-loss 1e-4 (PaLM) is sometimes filed under regularization — adopt it for stability, as discussed, not for generalization. Implicit regularizers you control: (1) batch size — smaller batches inject gradient noise; if you are above the critical batch size you are simultaneously wasting compute and removing beneficial noise; (2) the LR floor — decaying to near zero late sharpens minima found; (3) data deduplication — the single most effective "regularizer" against memorization (Lee et al. 2021); (4) weight averaging (EMA of checkpoints, or LAWA) gives a flatter-minimum effect at evaluation time for free; keep an EMA at decay 0.999–0.9999 and evaluate both. What not to do: stacking dropout + high weight decay + label smoothing on a 1-epoch web-scale run — a still-common reflex from the supervised-vision era that strictly slows convergence with no generalization payoff.
Q016	68	How do I choose between LayerNorm, RMSNorm, and alternative normalization placements in a decoder-only transformer?	5.5 Pro judge: 86 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. For decoder-only frontier models, pre-norm RMSNorm is the safest default because it is efficient, scale-stable, and widely validated. Choose LayerNorm when reproducing a known recipe, training bidirectional/encoder architectures, or requiring mean-centering for compatibility. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use x = x + Attention(RMSNorm(x)); x = x + MLP(RMSNorm(x)); final RMSNorm before logits. Set epsilon 1e-6 or 1e-5, norm scale=1, no norm bias unless needed, and no weight decay on norm parameters. Use residual-output scaling about 1/sqrt(2L). Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Track residual-stream RMS by layer and position, norm scale histograms, attention/MLP output RMS, gradient norms, and logit scale. If using parallel residual branches or post-norm/DeepNorm, rerun LR and initialization sweeps. For MoE, normalize before routing so router logits see stable inputs. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Post-norm divergence at depth, epsilon mismatch in fused inference kernels, forgetting final norm, decaying norm scales, upper-layer scale explosion, and extra norms that improve loss but hurt latency. Diagnose with layerwise activation growth and generation repetition/calibration checks. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Ba et al. 2016 LayerNorm; Zhang and Sennrich 2019 RMSNorm; Xiong et al. 2020 LayerNorm placement; Wang et al. 2022 DeepNet/DeepNorm; Touvron et al. 2023 LLaMA..	5.5 Pro judge: 95 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 92 Use pre-norm RMSNorm as the default: normalization applied to the input of each attention and FFN block, residual stream left untouched, plus one final norm before the LM head. This is the configuration of Llama 1–3, Mistral, Qwen, DeepSeek, Gemma (with a variant), and PaLM — the entire modern decoder-only lineage. LayerNorm vs RMSNorm: RMSNorm (Zhang & Sennrich 2019) drops mean-centering and the bias term, normalizing by the root-mean-square alone: x / rms(x) * g. It is 7–15% faster at the op level (one fewer reduction, fewer params), and at LM scale is loss-neutral versus LayerNorm — no published ablation shows a meaningful quality difference (Narang et al. 2021 found architecture tweaks rarely transfer, but RMSNorm is one of the few that consistently does no harm and saves time). Set epsilon to 1e-5 (1e-6 is common too; matters only for near-zero activations) and compute the statistics in FP32 even in BF16 training — reduced-precision norm statistics are a classic source of subtle divergence. Pre-norm vs post-norm: post-norm (original Transformer) places the norm after the residual addition; it yields slightly better final loss when it converges but has a vanishing-gradient pathology in deep stacks — gradients through N post-norm layers attenuate, requiring careful warmup, and beyond ~18–24 layers it routinely diverges (Xiong et al. 2020, "On Layer Normalization in the Transformer Architecture"). Pre-norm makes the residual stream an identity highway, trains stably to hundreds of layers, and is non-negotiable at LM scale. Its known cost: the residual-stream norm grows through depth, so later layers see larger-magnitude inputs and the "effective depth" is somewhat reduced (curse-of-depth literature); this is acceptable and universally accepted. Refinements with good evidence, in rough order of adoption priority: (1) QK-norm — LayerNorm/RMSNorm applied to queries and keys before the dot product (Dehghani et al. 2023, ViT-22B; adopted by OLMo 2, Gemma 3, several frontier efforts) — directly prevents attention-logit explosion, one of the two dominant spike mechanisms at ≥30B; cheap, adopt it for large or long runs. (2) Z-loss (1e-4 x log^2 of the softmax normalizer, PaLM) addresses the other mechanism, output-logit drift; it is a regularizer, not a norm, but belongs in the same stability budget. (3) Sandwich/double norm — normalizing both block input and block output (Gemma 2 uses pre+post block RMSNorm; OLMo 2 moved the norm to block outputs) — measured to reduce spike frequency in those reports; reasonable at ≥7B. (4) Embedding-layer norm or embedding scaling (Gemma multiplies embeddings by sqrt(d)): minor, follow your lineage. Things to avoid: removing the final pre-head norm (causes logit blowup); BatchNorm anywhere (batch-statistics coupling breaks autoregressive training); learnable per-channel bias in norms with weight decay applied to it (exclude norm gains/biases from weight decay always). Diagnostics tied to normalization: log per-layer residual-stream RMS (should grow smoothly, not exponentially, with depth), max attention logit per layer (>50 → add QK-norm), and norm-gain magnitudes over training (gains drifting to extreme values indicate the norm is fighting a scale pathology upstream). If you observe slow late-training divergence with post-norm-style placements, the fix is architectural, not LR — switch placement; no schedule rescues post-norm at depth.
Q011	70	What tokenizer algorithm and vocabulary size should I use for a multilingual general-purpose LLM?	5.5 Pro judge: 84 Opus 4.8 Judge: 83 Gemini 3.1 Pro judge: 69 Treat this as a controlled engineering decision, not a preference. For multilingual general-purpose LLMs, tokenizer choice is a modeling and equity decision. Use a reversible SentencePiece unigram or BPE tokenizer trained on raw Unicode with byte fallback, and balance the tokenizer-training sample so English, code, and boilerplate do not consume the vocabulary. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Default vocabulary size: 64k-128k for multilingual plus code; 128k-256k for hundreds of languages, heavy code/math, or multimodal protocol tokens. Use conservative Unicode cleanup and whitespace/control normalization; avoid aggressive lowercasing or accent stripping. Reserve BOS/EOS, document separators, role tokens, tool-call delimiters, JSON/schema delimiters, modality placeholders, and unused future tokens. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Measure round-trip exactness, fertility by language/domain, byte-fallback fraction, compression ratio, prompt-length changes, and detokenization validity. Reject if priority languages have median bytes/token or chars/token more than about 20%-30% worse than a reference without a compensating benefit. Test code indentation and mathematical notation. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Over-fragmenting low-resource scripts, merging whitespace in code, corrupting Unicode distinctions through NFKC, changing token ids mid-run, and discovering missing tool or modality tokens after post-training. Diagnose with fertility histograms and top pathological tokens by language. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Sennrich et al. 2016 BPE; Kudo 2018 subword regularization; Kudo and Richardson 2018 SentencePiece; Xue et al. 2021 mT5; Costa-jussa et al. 2022 NLLB; Touvron et al. 2023 LLaMA..	5.5 Pro judge: 95 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 88 Use byte-level BPE trained with a modern library (SentencePiece in BPE mode with byte_fallback, or HuggingFace tokenizers), and a vocabulary of 128k–256k for a multilingual general-purpose model. This is where the field has converged: GPT-4/4o use 100k/200k (tiktoken cl100k/o200k), Llama 3 moved from 32k to 128k explicitly for multilingual efficiency (Dubey et al. 2024), Gemma 2 and Gemini use 256k, Qwen 2.5 uses ~152k, DeepSeek-V3 128k. Below ~100k, non-Latin scripts fragment badly; above ~256k, the embedding matrix and softmax dominate small-model FLOPs and rare tokens become undertrained. Algorithm choice: BPE versus Unigram-LM matters less than its configuration. BPE is the de facto standard and marginally better studied at scale; Unigram (Kudo 2018) gives slightly better morphological segmentation in some languages and supports subword regularization, but few frontier models use it. Non-negotiable configuration details: (1) byte fallback so any input is encodable with zero UNK; (2) NFC normalization only — avoid NFKC's lossy mappings of symbols users care about; (3) digit handling: split numbers into individual digits (PaLM, Llama 2) or 1–3-digit groups (Llama 3, GPT-4 style) — three-digit grouping is the better current default for arithmetic; (4) whitespace-and-punctuation pretokenization regex of the GPT-4 family, which prevents tokens spanning words while allowing leading-space merges and code-friendly tab/newline runs; (5) preserve code whitespace: explicit tokens for runs of spaces (essential for Python indentation); (6) no lowercasing, no accent stripping. Training-corpus design for the tokenizer matters as much as the algorithm: train it on a mixture close to, but flatter than, your pretraining mixture. If you train the tokenizer on the raw English-heavy mix, low-resource languages get character-level fragmentation. Standard fix is temperature/alpha-resampling of languages at alpha around 0.3 (XLM-R, Conneau et al. 2020) for the tokenizer corpus. Measure success in fertility (tokens per word) or bytes-per-token per language: aim for fertility below ~2 for major languages and a max/min compression disparity across your supported languages of <3x; disparities translate directly into effective-context and cost inequities (documented in Petrov et al. 2023, "Language Model Tokenizers Introduce Unfairness"). Sizing nuance: embedding+unembedding parameters = 2 x V x d_model (if untied; tie them only below ~1B params). At V=256k, d=4096, that is 2.1B parameters — acceptable for a 70B model, distorting for a 3B one. Rule of thumb: keep vocab parameters under ~15% of total. Also round V to a multiple of 64/128 for tensor-core and TP-sharding efficiency (e.g., 131,072 not 130,000). Reserve 200–1,000 special-token slots upfront for chat templates, tool-use delimiters, FIM tokens (prefix/middle/suffix for code infill, Bavarian et al. 2022), and future modalities — retro-fitting tokens later forces embedding surgery. Diagnostics: (1) round-trip fidelity on a stress corpus (emoji, CJK, RTL scripts, code) must be byte-exact; (2) inspect the longest tokens in the learned vocab — garbage like long URL fragments or boilerplate strings ("SolidGoldMagikarp"-style glitch tokens, Rumbelow & Watkins 2023) signals tokenizer-corpus contamination and creates undertrained-token pathologies; (3) check per-language fertility tables before committing; (4) verify no token decodes to the empty string and that the special tokens never occur in natural data.
Q014	71	How should I set global batch size, token-based warmup, and gradient clipping for standard LLM pretraining?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 67 Treat this as a controlled engineering decision, not a preference. Set batch size in tokens and tune it jointly with LR, warmup, and clipping. The largest hardware-efficient batch is not automatically the best optimizer batch; compare loss at equal consumed tokens. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start global batch at 0.5M-2M tokens/update for sub-10B models and 2M-8M for larger models if generalization holds. Warm up for 0.5%-2% of planned tokens with a floor of 1,000-5,000 steps. Use true global norm clipping at 1.0; sweep 0.5 for unstable long-context/post-training and 2.0 for clean pretraining. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Run batch sizes {0.25M,1M,4M,8M tokens} with LR tuned for each. Choose the largest batch within roughly 0.01-0.03 nats of best validation loss at fixed tokens and materially better utilization. Log clipping fraction, gradient noise estimates, update/weight ratios, and number of optimizer updates per token budget. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Comparing by steps instead of tokens, local gradient clipping, persistent clipping above 1%-5%, linear LR scaling beyond the gradient-noise scale, and warmup too short for huge batches. For MoE, router gradients may need separate clipping or smaller LR. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: McCandlish et al. 2018 gradient noise scale; Goyal et al. 2017 batch scaling; Loshchilov and Hutter 2019 AdamW; Kaplan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021..	5.5 Pro judge: 96 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 89 Global batch size: think in tokens, not sequences. Published anchors: ~0.5M tokens for 1B-class models, 1–2M for 7B, 2–4M for 13–34B, 4M+ for 70B (GPT-3: 3.2M at 175B; Llama 2: 4M; Chinchilla: 1.5M at 70B; DeepSeek-V3 ramped to ~15M for a 671B MoE). The governing concept is the critical batch size (McCandlish et al. 2018): below it, doubling batch nearly halves steps at equal token-efficiency; above it, you burn tokens for no step-count gain. Critical batch size grows as loss falls — roughly with the inverse of loss progress — which justifies batch-size warmup: e.g., DeepSeek-V3 ramped 3k→15k sequences over the first ~470B tokens; GPT-3 ramped from 32k to 3.2M tokens over the first 4–12B tokens. For a standard dense 7B run, a fixed 2M-token batch is fine; add a ramp (start at 1/4 target, reach full by 2–5% of tokens) if you observe early instability or are pushing batch above ~4M. Practical formula: global_batch_tokens = seq_len x microbatch x grad_accum x DP_size. Choose it also for systems reasons — it must be divisible by DP x microbatch x seq_len, and larger batches improve pipeline efficiency (more microbatches per step) — but stay within 2x of the statistical anchors above. Token-based warmup (LR): define warmup in tokens, not steps, so it is invariant to batch-size changes. 0.1–1% of the total token budget is the standard band: GPT-3 used 375M tokens of 300B (0.125%); 2,000 steps at 4M tokens = 8B tokens (~0.5% of a 1.5T run) is the common Llama-style setting. Use linear warmup from ~0. If you ramp batch size simultaneously, keep LR warmup token-denominated and finish the batch ramp before or at the same time as LR warmup ends — overlapping a growing batch with peak LR is a known spike recipe. When you scale batch by k within the sub-critical regime, scale peak LR by sqrt(k) (Adam heuristic; Malladi et al. 2022) rather than linearly. Gradient clipping: global L2 norm clip at 1.0 is near-universal (GPT-3, Llama, PaLM, OPT all used 1.0; some use 0.5 during instability). Implementation requirements: compute the norm over all parameters globally (across TP/PP shards — a per-shard clip is a silent correctness bug), in FP32, after gradient accumulation and after any all-reduce, before the optimizer step. Log both the pre-clip norm and the clip-trigger rate. Interpretation: during warmup, clipping on 10–50% of steps is normal; in steady state, the trigger rate should fall below ~1–2%. Persistent clipping means peak LR is too high or data is pathological; clipping never triggering at all means you could likely have used a higher LR. A useful refinement at scale is spike skipping: skip the update entirely when the grad norm exceeds, say, 3–5x the trailing median (used in various forms by Falcon and others), which converts rare data-driven spikes into a single lost step instead of a corrupted optimizer state — but log every skip and alarm if skips cluster. Failure modes: loss divergence right at warmup end (lower peak LR 30% or double warmup); sawtooth loss synchronized with batch ramp milestones (ramp too fast); a permanent loss offset after a batch increase mid-run (you exceeded critical batch size — revert or raise LR by sqrt).
Q015	72	When should I use absolute, rotary, ALiBi, or another positional encoding method?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Default to RoPE for modern decoder-only LLMs unless a long-extrapolation ablation proves ALiBi or another scheme wins. Positional encoding is not isolated; it interacts with context extension, attention kernels, and checkpoint compatibility. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use RoPE with a fixed base convention; common theta values include 10,000 for shorter contexts and 100,000-1,000,000 in long-context variants. Learned absolute embeddings are acceptable for small fixed-window reproduction work, not frontier extensibility. Use ALiBi when train-short/test-long simplicity and negligible memory overhead are priorities. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Before long-context extension, plot perplexity by position under candidate scaling. For RoPE extension, test position interpolation, NTK-aware scaling, YaRN, LongRoPE-style scaling, or continued training. Keep 50%-80% short sequences during extension and evaluate needle retrieval, long QA, code repositories, and short-context regression. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Raising context length without adapting RoPE, local syntax degradation from aggressive scaling, models passing passkey tests but failing reasoning, repetition near old context boundary, and changing positional scheme mid-run without migration. Diagnose attention entropy by distance and loss by token position. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Vaswani et al. 2017; Raffel et al. 2020 T5 relative bias; Su et al. 2021 RoPE; Press et al. 2021 ALiBi; Chen et al. 2023 position interpolation; Peng et al. 2023 YaRN; Ding et al. 2024 LongRoPE..	5.5 Pro judge: 95 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 93 Default: rotary position embeddings (RoPE; Su et al. 2021). It is what Llama 1–3, Mistral, Qwen, DeepSeek, Gemma and essentially all current open frontier models use. RoPE encodes relative position by rotating query/key pairs per frequency band, costs nothing in parameters, works cleanly with FlashAttention, and — decisively — supports post-hoc context extension via frequency rescaling (position interpolation, NTK-aware scaling, YaRN; Peng et al. 2023). Set the base frequency theta with your context plan in mind: theta=10,000 was the original default and is fine to 4k; use theta=500,000 (Llama 3) when pretraining at 8k with intent to extend to 128k; raising theta at extension time is the standard recipe (e.g., Llama 3 went to theta=8M for 128k context). Apply RoPE in FP32; BF16 angle computation corrupts large positions. Partial-rotary (rotating only 25–50% of head dims, GPT-NeoX style) is a reasonable variant that leaves channels free for content. Absolute learned positions (GPT-2/3, OPT): obsolete for new models. They hard-fail beyond the trained length (no extrapolation at all), add V_pos x d parameters, and underperform RoPE at equal scale. Only choose them for strict reproduction of legacy baselines. Sinusoidal absolute (original Transformer): likewise superseded. ALiBi (Press et al. 2022): adds head-specific linear distance penalties to attention logits instead of rotating embeddings. Strengths: best zero-shot length extrapolation of the simple methods (train 2k, run 4–8k with modest degradation), trivially simple, no embedding interactions; used by BLOOM and MPT, and notably by Jamba in its non-positional flavor of design space. Weaknesses: the imposed recency bias measurably hurts retrieval-style long-range tasks (passkey/needle benchmarks) compared to extended RoPE, and FlashAttention support for per-head biases has historically lagged (now available via the slopes path). Choose ALiBi when you need cheap, robust modest extrapolation and your workload is locality-dominated; choose RoPE when you will do an explicit long-context extension stage. NoPE (no positional encoding at all): decoder-only causal models can learn position implicitly (Haviv et al. 2022; Kazemnejad et al. 2023 show competitive length generalization on algorithmic tasks). Some hybrid/SSM stacks and DeepSeek's MLA-adjacent designs use NoPE on some layers. Interesting for research; not yet the conservative choice for a flagship LM, though mixing NoPE layers with RoPE layers (e.g., every 4th layer NoPE, as in some recent long-context recipes like Llama 4's iRoPE) is gaining evidence as an extension-friendly design. Emerging refinements worth adopting at the margin: per-layer or scaled variants (e.g., low-frequency truncation in Llama 3.1's RoPE scaling function), and document-aware position reset when packing sequences (restart positions at each document boundary with cross-document attention masking — improves short-document quality at no cost). Decision summary: greenfield general-purpose LM → RoPE, theta chosen for the final intended context, FP32 rotation, plan a YaRN-style extension stage. Memory-light extrapolation without an extension stage → ALiBi. Reproducing GPT-3-era baselines → learned absolute. Research on length generalization → evaluate NoPE/hybrid layouts. Diagnostics: passkey-retrieval accuracy as a function of position (should be flat across the trained window), attention-entropy by distance band, and perplexity-vs-position curves on long held-out documents — a rising tail localizes positional failure before downstream evals notice.
Q013	74	How do I select transformer depth, hidden width, attention heads, and feed-forward dimensions for a target parameter count?	5.5 Pro judge: 87 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Choose transformer dimensions from target parameter count, kernel efficiency, and parallelism divisibility. Use a modern decoder-only baseline: pre-norm RMSNorm, RoPE, SwiGLU, head dimension 64-128, and dimensions aligned to tensor-core tiles. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For dense SwiGLU blocks, approximate non-embedding parameters as 12Ld_model^2. Embeddings add vocab_sized_model, doubled if output weights are untied. Plausible shapes: 1B d=2048 L=24; 7B d=4096 L=32; 13B d=5120 L=40; 70B d=8192 L=80. FFN hidden for SwiGLU is often about 8/3d_model; use multiples of 256. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Sweep depth/width at equal FLOPs and tokens. Require number of heads divisible by tensor-parallel degree and d_head in {64,80,96,128}. Evaluate MFU, memory, KV-cache, latency, validation loss by domain, and downstream transfer. Test GQA for inference-heavy models, e.g. 32 query heads with 8 KV heads. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Head_dim too small, vocab embeddings dominating small models, FFN sizes misaligned with kernels, tensor-parallel-incompatible head counts, too shallow/wide models underperforming reasoning, and residual activations growing with depth. Diagnose with layerwise RMS and equal-compute ablations. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Vaswani et al. 2017; Shazeer 2020 GLU variants; Su et al. 2021 RoPE; Zhang and Sennrich 2019 RMSNorm; Kaplan et al. 2020; Hoffmann et al. 2022; Touvron et al. 2023..	5.5 Pro judge: 96 Opus 4.8 Judge: 90 Gemini 3.1 Pro judge: 90 Work backwards from the parameter budget using the standard decoder-only parameter count: N ≈ V x d (embeddings, x2 if untied) + L x (4 d^2 attention + 2 x d x d_ff feed-forward, or 3 x d x d_ff for SwiGLU). Then apply shape heuristics that the published record agrees on: Aspect ratio (d_model / n_layers): keep it between roughly 32 and 128 in the 1–10B range, trending higher with scale. Kaplan et al. 2020 showed loss is remarkably flat across a wide basin of depth/width at fixed N — you have latitude — but extremes hurt: too deep (aspect <20) trains slowly and less stably; too wide (aspect >200) underuses depth for composition. Concrete reference shapes: 1.3B: L=24, d=2048; 2.7B: L=32, d=2560; 7B: L=32, d=4096; 13B: L=40, d=5120; 70B: L=80, d=8192 (the GPT-3/Llama lineage). Levine et al. 2020 give the theoretical depth-width tradeoff if you want to optimize; deviations of ±25% from these shapes cost well under 1% loss. Head dimension: fix d_head at 64–128 and set n_heads = d_model / d_head; 128 is the modern default (Llama, Mistral) because larger head_dim improves FlashAttention throughput and RoPE resolution. Do not go below 64 — attention quality degrades (Bhojanapalli et al. 2020 on low-rank attention bottlenecks). KV heads: use grouped-query attention with 8 KV heads regardless of total heads, for inference KV-cache economy (Ainslie et al. 2023, GQA; Llama 2 70B, Llama 3, Mistral all use 8). Quality loss versus full MHA is negligible at ≥7B; train with GQA from scratch rather than uptraining if you know you need it. Feed-forward dimension: with SwiGLU (Shazeer 2020; standard since PaLM/Llama), set d_ff = 8/3 x d_model rounded to a hardware-friendly multiple (e.g., Llama 7B: d=4096, d_ff=11008), so the SwiGLU block's 3 matrices cost the same as the classic 2-matrix 4x FFN. The FFN:attention parameter ratio then stays ~2:1, which is the well-tested regime. Hardware-divisibility constraints, applied last: d_model divisible by n_heads, and ideally by 128 for tensor cores; d_ff and vocab divisible by your tensor-parallel degree times 64; n_layers divisible by pipeline stages. These constraints cost nothing if applied as rounding and cause real pain mid-run if ignored. Other defaults worth fixing now: RoPE positions (theta 500k if you plan ≥8k context); RMSNorm pre-norm; no biases; untied embeddings above ~1B; vocab parameters <15% of N (shrink V or grow the body if violated — a 1B model with a 256k vocab is mostly embeddings). Validation procedure: at your target N, the cheap insurance is a 3-point shape ablation at 1/50th scale — e.g., for a planned 7B, train 350M models with aspect ratios 40/64/100 for 7B tokens each and confirm the basin is flat for your data mix; data mixes heavy in code sometimes prefer deeper models. Diagnostics that shape is wrong only appear at extremes: very deep-narrow models show late-layer gradient attenuation and slower wall-clock per FLOP (kernel inefficiency at small d); very wide-shallow models plateau early on compositional evals (multi-step reasoning) despite matched loss. When in doubt, copy the closest published shape — shape is a solved problem at these scales, and your compute is better spent on data and tokens.
Q020	75	How should I resume a pretraining run after changing the data mixture, sequence length, or cluster size?	5.5 Pro judge: 84 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Classify a resume as continuation, phase transition, or new experiment. Preserve optimizer state for a true continuation; use a checkpointed, evaluated phase boundary for data-mixture, sequence-length, or cluster-size changes. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: If only cluster size changes, keep global batch tokens constant by adjusting accumulation and preserve LR/Adam moments. For mixture changes, evaluate the checkpoint on old and new validation mixes, then ramp LR at 0.3-0.7x for 500-5,000 steps if needed. For 2k-to-8k context changes, use 2k/4k/8k curriculum over several billion tokens or a few percent of budget. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: After resume, inspect first 100 steps: loss, gradient norm, LR, sample ids, validation on old and new sets, throughput, and restore ability. Keep short examples during context extension and compare loss by position. Replay sampler state across new world size. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Silent loss discontinuities, duplicated data after sampler remap, changing global batch without LR retune, resetting Adam moments unnecessarily, long-context extension forgetting short-context quality, and topology changes breaking optimizer shard ownership. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Goyal et al. 2017; McCandlish et al. 2018; Press et al. 2021; Chen et al. 2023; Narayanan et al. 2021; Hoffmann et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 91 Gemini 3.1 Pro judge: 87 Each change has a different blast radius; handle them separately and never change more than one at a time without a checkpoint boundary and an eval snapshot on both sides. Data-mixture change: the safest of the three. Procedure: (1) snapshot full eval battery at the boundary checkpoint; (2) switch the per-source sampling weights (this is why you keep sources as separate streams); (3) expect a step-discontinuity in training loss — the loss is now measured on a different distribution; this is not a regression. Re-baseline: compare per-domain validation losses, which are mixture-invariant, not the headline training loss. (4) If the new mixture upweights a domain sharply (>2–3x), ramp the weights over 5–20B tokens rather than stepping — a hard step into heavy code/math after a web-dominated phase produces a transient grad-norm spike and occasionally a real divergence at low remaining LR. (5) Optimizer state needs no reset for mixture changes. This is routine: staged-mixture training (bulk → annealing upweight) is standard practice (Llama 3, MiniCPM, OLMo 2). Sequence-length change (e.g., 4k→8k→32k extension): (1) keep total tokens per batch roughly constant — halve the sequences per batch as you double length, so the optimizer sees a similar token throughput and noise scale; (2) adjust RoPE before resuming: raise theta or apply the chosen scaling (NTK/YaRN) at the moment of extension, not after; (3) blend in genuinely long documents (upsample long-doc sources; synthetic concatenation alone teaches packing artifacts) — standard recipes use a few tens of billions of tokens at the new length (Llama 3: ~800B tokens across its long-context stages, but 10–100B suffices for 8k→32k at 7B scale); (4) expect attention-memory growth ~linear with FlashAttention; re-derive microbatch and re-verify memory headroom; (5) monitor short-context evals during extension — short-context regression is the canonical failure; keep 50–80% of the extension mixture at the original length to prevent it. Cluster-size / parallelism change: pure systems change if and only if the global batch in tokens is preserved — then training dynamics are identical and you only need a checkpoint format that reshards (torch DCP, Megatron distributed checkpoints). Verify with a 100-step bitwise-or-near loss overlay after resharding. If the new cluster size forces a different global batch: scale LR by sqrt(batch ratio) for Adam, re-run a short stability window at reduced LR (90% of target) for ~1B tokens, and treat critical-batch-size as the ceiling — if you were at 4M tokens and double the cluster, prefer shorter wall-clock at the same batch over pushing to 8M without evidence. Optimizer state must be resharded exactly — m and v travel with their parameters; any "fresh Adam state" resume costs you a visible loss bump that takes billions of tokens to recover (the bias-corrected second moment re-estimates from zero, transiently inflating updates; if you must reinit state, re-warm LR from ~0 over 100–500M tokens). Universal hygiene: tag every resume in the experiment tracker with a vertical annotation on all dashboards; keep the pre-change checkpoint frozen until the post-change run has demonstrably healthy loss slope and eval parity over ≥10B tokens; and rehearse the rollback. The dominant operational failure is compounding two changes (e.g., new mixture + new batch size) and being unable to attribute the resulting regression — one variable per boundary, always.
Q012	76	How should I clean, normalize, deduplicate, and filter web text before pretraining?	5.5 Pro judge: 87 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 68 Treat this as a controlled engineering decision, not a preference. Web-text preparation should be a staged, auditable filtering pipeline with token counts and examples before and after every filter. Treat quality, deduplication, PII removal, and benchmark contamination as separate controls. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Extract text robustly; remove navigation, scripts, cookie banners, repeated sidebars, malformed markup, and obvious spam. Normalize Unicode, whitespace, line endings, and controls while preserving code/math. Run exact dedup by normalized hash, then near-dedup with MinHash/LSH over word or character shingles at Jaccard thresholds around 0.7-0.9. Use language-id and quality buckets, not only binary keep/drop. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use heuristics and classifiers: length, repeated-line fraction, alphabetic ratio, punctuation, stopwords by language, compression ratio, small-LM perplexity, boilerplate phrases, domain reputation, parseability for code, generated-file detection, and secret/credential filters. Train proxy models on candidate datasets at equal tokens and sample rejected/kept documents for human audit. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Over-filtering dialects and low-resource languages, retaining SEO templates, deleting code indentation, contaminating evals through mirrors or synthetic explanations, and training quality classifiers on contaminated examples. Diagnose with removal-rate dashboards by source/language and contamination scans using exact, n-gram, MinHash, embedding, and answer-string matching. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Raffel et al. 2020 C4; Gao et al. 2020 The Pile; Penedo et al. 2023 RefinedWeb; Soldaini et al. 2024 Dolma; Lee et al. 2022 deduplication; Carlini et al. 2021/2022 memorization..	5.5 Pro judge: 97 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 90 Run a fixed-order pipeline; order matters because each stage's heuristics assume the previous stage's output. The canonical sequence, following CCNet (Wenzek et al. 2020), Gopher (Rae et al. 2021), RefinedWeb (Penedo et al. 2023), FineWeb (Penedo et al. 2024), and DCLM (Li et al. 2024): (1) Extraction. From WARC HTML, use trafilatura or resiliparse rather than WET files — FineWeb measured material downstream gains from better extraction alone. Strip boilerplate, navigation, cookie banners. (2) Language ID. fastText LID (or GlotLID for long-tail languages) at document level; threshold ~0.65 confidence. Route languages to per-language pipelines; do not apply English heuristics to CJK. (3) Quality heuristics (Gopher rules, applied per language): drop documents with <50 or >100k words; mean word length outside 3–10 chars; >30% lines ending in ellipsis; <80% lines starting with an alphabetic character; symbol-to-word ratio >0.1 for # or ellipsis; fewer than 2 of the common English stop words (adapted per language); bullet-point lines >90%. Add C4-style rules selectively: drop lines without terminal punctuation hurts code/forums, so apply per-source; remove "lorem ipsum", javascript-required boilerplate, curly-brace pages from prose streams. (4) Deduplication, two levels. Exact: hash of normalized text (lowercased, whitespace-collapsed). Near-dup: MinHash with ~9,000 permutations bucketed for Jaccard ~0.8 over 5-gram word shingles (Lee et al. 2021; RefinedWeb config), or suffix-array substring dedup removing repeated spans ≥50 tokens. Important empirical caution from FineWeb: aggressive global dedup across all CC dumps can hurt — dedup within dump, keep one copy across dumps, and measure. Also do paragraph-level dedup for boilerplate (headers/footers repeated across a site). (5) Model-based quality filtering — the single highest-leverage stage by current evidence. DCLM's fastText classifier (trained on OpenHermes+ELI5 as positives vs raw CC negatives) selecting the top ~10% beat all heuristic-only baselines; FineWeb-Edu used a Llama-3-70B-annotated educational-quality regressor (keep score ≥3/5) and dramatically improved MMLU/ARC trajectories. Train a cheap classifier (fastText or a 100M-param model) on a few hundred thousand labeled examples, calibrate the threshold by training 1B-scale proxy models on slices at different thresholds, and pick via downstream evals, not classifier AUC. (6) Safety/PII. Regex-scrub emails, phone numbers, IPs (BigScience ROOTS scrubbers); filter NSFW/toxic domains via URL blocklists primarily — text-level toxicity classifiers overfilter dialect and medical text (documented in Dodge et al. 2021 on C4); remove known CSAM/exploit domains absolutely. (7) Decontamination. Remove documents with 13-gram overlap against all evaluation benchmarks you will ever report (do this last, against the final tokenized form). Normalization throughout: NFC Unicode, strip control characters except newline/tab, collapse >2 consecutive newlines, fix mojibake with ftfy. Do not lowercase, do not strip punctuation. Validation: at every stage, log retention rate per domain and inspect 100 random kept and dropped documents — every published pipeline retrospective reports a rule that silently deleted something valuable (e.g., terminal-punctuation rules deleting poetry and code). Final check: train two 1B proxies on before/after data at ~20 tokens/param; the filtered version should win on a fixed eval suite (HellaSwag, ARC, MMLU-subset). Expect to keep roughly 10–30% of raw Common Crawl text after the full pipeline.
Q017	77	How should I evaluate pretraining checkpoints beyond validation loss to catch regressions in language, knowledge, and reasoning?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Evaluate checkpoints with an increasingly expensive ladder; validation loss is necessary but not sufficient for language, knowledge, reasoning, safety, or contamination regressions. Separate base-model evaluation from chat/post-training evaluation. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Every checkpoint run validation loss by stratum, repetition, calibration, tokenizer byte-fallback in generations, and fixed-prompt samples. Candidate checkpoints run MMLU/MMLU-Pro-style breadth, HellaSwag/ARC, GSM8K/MATH/AIME-style math, HumanEval/MBPP/SWE-bench-lite-style code, FLORES/XNLI/CMMLU-style multilingual, long-context retrieval/QA, toxicity, bias, and robustness. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use pinned prompts, few-shot examples, decoding configs, dataset hashes, and scoring commits. For code/math, execute or verify answers. Report confidence intervals. Scan all train/SFT/synthetic data against benchmarks with exact hash, n-gram, MinHash, embedding retrieval, and answer-string search; report metrics with suspect items removed. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Public benchmark overfitting, prompt drift, comparing tokenizers unfairly, aggregate scores hiding subgroup collapse, LLM judges preferring verbosity, and sudden benchmark jumps from contamination. Diagnose with private perturbation sets and item-level contamination risk. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Hendrycks et al. 2021 MMLU; Srivastava et al. 2022 BIG-bench; Liang et al. 2022 HELM; Chen et al. 2021 HumanEval/Codex; Cobbe et al. 2021 GSM8K; Lewkowycz et al. 2022 Minerva; Brown et al. 2020..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 89 Validation loss is necessary but radically insufficient: it averages over the token distribution, so a 2% regression in math or a collapse in a low-resource language can hide inside a 0.001-nat aggregate move. Build a checkpoint-evaluation battery with four layers, run on a fixed cadence (e.g., every 50–100B tokens, plus on demand before/after any run intervention). Layer 1 — decomposed perplexity. Per-domain validation losses (web, books, code, math, each major language), plus bits-per-byte on standard slices (e.g., Paloma, Magnusson et al. 2023, for stratified web perplexity). Bits-per-byte rather than per-token loss when comparing across tokenizer changes. Alert on any domain regressing >0.005 nats between adjacent checkpoints while the aggregate falls — the signature of mixture drift or a poisoned shard. Layer 2 — cheap likelihood-based benchmarks, runnable in minutes. The standard early-signal suite: LAMBADA (long-range dependency), HellaSwag, PIQA, ARC-easy/challenge, Winogrande, BoolQ, plus MMLU (5-shot). Critical methodological detail: at early/mid training, use continuous metrics — per-choice length-normalized log-likelihood or Brier-style scoring — rather than exact-match accuracy, because accuracy sits at chance for a long time and then jumps, creating the illusion of "emergence" that is mostly a metric artifact (Schaeffer et al. 2023, "Are Emergent Abilities a Mirage?"); likelihood-based versions trend smoothly from very early (the technique behind Llama 3's and FineWeb's "early signal" eval design). Track the full trajectory against a reference curve from a previous good run of similar scale — deviation from the expected trajectory matters more than the absolute number. Layer 3 — generation-based probes. Small fixed prompt sets (200–500 prompts) scored automatically: GSM8K subsample with answer extraction (reasoning), HumanEval/MBPP subset with execution (code — execution-based, never string match), simple instruction-following canaries, repetition rate and mean entropy of greedy continuations (degeneration watch), and factual cloze sets for knowledge. Generation evals catch tokenizer/decoding pathologies and degeneration that likelihood evals miss entirely. Layer 4 — capability-specific deep dives at milestone checkpoints (every ~500B tokens): full MMLU/MMLU-Pro, GSM8K full, MATH subset, multilingual suites (FLORES for translation, multilingual MMLU variants), long-context passkey/needle tests at the trained length. Methodological requirements that make results trustworthy: pin the eval harness version and prompts (lm-evaluation-harness with a frozen commit; prompt formatting swings MMLU by multiple points); fix few-shot exemplars and seeds; evaluate the EMA/averaged weights if you use them for deployment; decontaminate training data against every benchmark in the battery (13-gram overlap) and additionally monitor contamination online — a benchmark jumping >2–3 points between adjacent checkpoints with flat neighbors is contamination until proven otherwise (check per-item loss on benchmark strings); keep at least one fully private, internally authored eval set that never appears in any public dump. Regression-gating practice: define, per metric, a noise band from checkpoint-to-checkpoint variance on a healthy run (e.g., HellaSwag ±0.4 points at 7B), and flag any 2-sigma breach. When a regression flags, bisect: re-run the eval (harness flake), then check the data shards that entered in the interval, then diff per-domain losses to localize. The whole battery through Layer 3 should cost <0.5% of training compute — there is no economy in evaluating less.
Q018	79	How do I mix natural language, code, mathematics, and multilingual data without allowing one domain to dominate training?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Mix domains by explicit token budgets and capability hypotheses, not file availability. Natural language, code, math, multilingual, instruction-like, and safety-relevant data should each have target weights, repeat caps, contamination reports, and evals. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: A starting exploratory mixture might allocate 40%-70% general natural language, 10%-30% code, 5%-20% math/science, and 10%-30% multilingual, adjusted to goals. Within languages, use temperature sampling p_i proportional to n_i^alpha with alpha 0.3-0.7. Sample code by repository/file group and math by source family; distinguish exposition, problem-solution, formal proofs, and executable tasks. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Train proxy models with upweight sweeps and measure cross-domain transfer. Log realized token weights over sliding windows and alert above 1%-2% for major buckets. Evaluate natural language perplexity, HumanEval/MBPP, math verifiers, multilingual tasks, and safety gates after mixture changes. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: English/code dominance causing language collapse, math data causing verbose incorrect reasoning, code markdown leaking into prose, synthetic teacher errors amplifying, and high-quality narrow data reducing diversity. Diagnose by domain-level loss and private eval regressions. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Raffel et al. 2020; Gao et al. 2020 The Pile; Costa-jussa et al. 2022 NLLB; Chen et al. 2021 Codex; Lewkowycz et al. 2022 Minerva; Penedo et al. 2023; Hoffmann et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 88 Treat mixture weights as first-class hyperparameters, tuned with proxy experiments, enforced by per-source sampling, and bounded by epoch-count constraints. Mechanics first: keep each domain as a separate stream and draw each training sequence from domain i with probability w_i (per-sequence categorical sampling, seeded). Never pre-concatenate and shuffle once — you lose the ability to reweight mid-run, and exhausted domains silently repeat. Log realized composition per batch and alarm on drift. Reference points for weights (final mixtures of published runs): Llama 3 reported roughly 50% general knowledge/web, 25% math+reasoning, 17% code, 8% multilingual (Dubey et al. 2024); DeepSeek and Qwen use code+math fractions in the 20–30% band; The Pile (Gao et al. 2020) and Dolma (Soldaini et al. 2024) document fully open recipes. Two robust empirical regularities: (a) code in the 10–25% range improves general reasoning, not just coding (documented in several ablation lines, e.g., Aryabumi et al. 2024 "To Code or Not To Code"); (b) past ~30–40% for any specialist domain, general English quality measurably regresses — the dominance failure you are guarding against. Constraint that binds before preference: epochs per domain. High-quality domains are small. Repeating data beyond ~4 epochs delivers rapidly diminishing returns, and degradation risk grows beyond that (Muennighoff et al. 2023, "Scaling Data-Constrained Language Models" — up to 4 epochs is nearly as good as fresh data; beyond ~16, close to worthless). So compute w_i_max = 4 x size_i / total_tokens and cap every domain there. Wikipedia at 5B tokens in a 2T-token run caps at ~1% no matter how much you like it. Tuning the weights: (1) Temperature baseline — sample domains proportional to size^alpha with alpha 0.5–0.7 to lift small high-quality sources without exploding their epoch counts; same trick per language for multilingual balance (alpha ~0.3 historically for low-resource lift, XLM-R). (2) Proxy-model sweeps — train 300M–1B models for 20–50B tokens per candidate mixture and read per-domain validation losses plus the early-signal eval suite; mixture rankings at small scale transfer reasonably to 10–100x larger models (this is the validated workflow behind FineWeb's and DCLM's decisions). (3) Algorithmic reweighting if you have many sources: DoReMi (Xie et al. 2023) learns weights via a small proxy/reference pair with group-DRO and transferred well in its experiments (8x faster convergence claims at 280M→8B); online variants (ODM, Albalak et al. 2023) exist but add machinery. In practice most labs land on hand-tuned weights validated by proxy sweeps; DoReMi is a good initializer. Guard the aggregate with per-domain validation losses: the multilingual loss rising while code falls means code is eating multilingual capacity; rebalance or grow the model. Interference is real but mostly shows when a domain exceeds ~30% or the model is small (<7B) relative to the domain count. Finally, separate mixture-for-bulk from mixture-for-annealing: keep the highest-quality math/code/instructional data partially in reserve and upweight it sharply (2–10x) in the final 10–20% of training during LR decay — the annealing-phase upweighting trick (MiniCPM, Hu et al. 2024; Llama 3; OLMo 2) yields outsized benchmark gains per token and lets the bulk phase run on cheaper, more plentiful data.
Q028	80	When should I use activation checkpointing, CPU or NVMe offload, optimizer sharding, or recomputation?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. Choose memory techniques by the memory component that dominates: parameters, gradients, optimizer states, activations, KV/context tensors, communication buffers, or fragmentation. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use activation checkpointing first; expect about 20%-35% extra compute depending on granularity. Use ZeRO-1 for optimizer states, ZeRO-2 for gradients, ZeRO-3/FSDP for parameters when needed. Use FlashAttention to avoid materializing full attention matrices. Use CPU offload only when bandwidth and overlap are acceptable; use NVMe offload as a last resort for throughput-critical training. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Maintain a per-rank memory budget table and compare peak allocated versus reserved. Profile throughput before and after each technique. Try smaller microbatch plus more accumulation, sequence parallelism, GQA, lower sequence length, or more model parallelism before slow offload. For MoE, use expert parallelism and dropless block-sparse kernels when available. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Saving memory while losing so much MFU that cost rises, CPU offload saturating PCIe, NVMe offload colliding with checkpoint I/O, ZeRO prefetch causing p99 stalls, and checkpointing nondeterministic dropout incorrectly. Diagnose with memory and communication traces. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Chen et al. 2016 gradient checkpointing; Rajbhandari et al. 2020 ZeRO; Ren et al. 2021 ZeRO-Offload; Dao et al. 2022 FlashAttention; Korthikanti et al. 2022; Gale et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 91 Order these techniques by their cost-per-byte-saved, and apply them only until the configuration fits with ~10% headroom — every technique after "it fits" is pure overhead. Memory anatomy per GPU first: BF16 weights 2N/(TP·PP), gradients 2N/(TP·PP), Adam state + FP32 master 12N/(TP·PP·ZeRO_dp) bytes, plus activations ∝ microbatch x seq x hidden x layers/PP, plus NCCL/workspace buffers (several GB) and fragmentation. (1) Optimizer sharding (ZeRO-1) — always on, effectively free. Sharding Adam state across DP ranks costs only a parameter all-gather that replaces/joins the existing gradient collective (reduce-scatter + all-gather pattern); at DP ≥ 8 it removes the largest single memory block (12 bytes/param → 12/DP). There is no regime where unsharded optimizer state at DP>4 is the right answer. ZeRO-2 (also shard gradients) is similarly cheap with bucketed reduce-scatter. ZeRO-3/FSDP (shard weights too) adds per-layer all-gathers in forward and backward — fine within a node or on fat fabrics, increasingly exposed at large DP across slow links; prefer TP/PP over ZeRO-3 beyond ~30B on commodity IB unless your stack's prefetch overlap is excellent. (2) Activation checkpointing — the workhorse. Full checkpointing (store only block inputs, recompute the block in backward) cuts activation memory ~5–10x for ~33% more compute (forward recompute; HFU≈1.33·MFU). Selective checkpointing is strictly better when sufficient: recompute only the attention internals (softmax scores, dropout masks — the memory-heavy, FLOP-light parts), costing ~3–5% time for most of the benefit (Korthikanti et al. 2022: 5x activation reduction with selective recompute + sequence parallelism). Policy: enable SP (with TP) first, then selective, escalate to full only on the layers/stages that still OOM (early pipeline stages hold the most in-flight activations — checkpoint those preferentially). FlashAttention already avoids materializing the s² score matrix; do not double-pay by recomputing what it never stored. (3) Recomputation beyond checkpointing — same principle applied surgically: recompute RoPE, norms, activation functions (cheap, memory-fat) via fused kernels that never store intermediates; most modern stacks (TE, Liger kernels) do this by default. (4) CPU offload — optimizer state first. ZeRO-Offload/ZeRO-3-offload pushes Adam state (+optimizer step) to host: saves 12 bytes/param of HBM at the cost of PCIe transfers (~25–50 GB/s) and CPU-side Adam; viable when step time is long enough to hide transfers — typically fine-tuning or small-cluster training of big models, not competitive for throughput-critical pretraining on adequate clusters (it caps step rate hard). Activation offload to host is attractive specifically for pipeline-held activations and long-context training, where reuse latency is long (Megatron's CPU activation offload, various "infinite context" stacks); budget PCIe bandwidth against your step time analytically before enabling. (5) NVMe offload (ZeRO-Infinity) — last resort. ~5–7 GB/s per node makes it strictly a "fit a model that otherwise cannot run" tool (e.g., fine-tuning 70B on one 8-GPU node); never on the critical path of a production pretraining run. Decision procedure: compute the memory budget analytically for the candidate (TP, PP, DP, ZeRO stage); if over, apply in order: ZeRO-1 → SP → selective checkpointing → ZeRO-2 → rebalanced/full checkpointing on hot stages → more TP/PP → offload. Diagnostics: torch.cuda.memory_summary for fragmentation (large reserved-but-unallocated → expandable_segments), OOMs at step ~1 vs step ~1000 (static misbudget vs fragmentation/leak), and throughput before/after each technique — if tokens/sec dropped more than the analytic prediction, the overlap (prefetch, recompute scheduling) is misconfigured, not the technique itself.
Q030	82	How do I design a high-throughput data loader that keeps a distributed training cluster continuously fed?	5.5 Pro judge: 87 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 68 Treat this as a controlled engineering decision, not a preference. A high-throughput data loader should be an offline pretokenized data system plus a deterministic online sampler. Online tokenization and centralized coordination are enemies of frontier-scale utilization. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Pretokenize immutable memory-mappable shards containing token ids, document boundaries, metadata ids, and checksums. Use shard sizes about 128MB-2GB. Each DP rank receives disjoint sample indices from global consumed tokens, rank, world size, and seed. Prefetch several batches into pinned memory and overlap GPU transfer. Use local NVMe or node-local caches for remote storage. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Sustain at least 1.2x required tokens/sec during training and 2x during stress tests. Log dataloader wait, batch assembly time, read throughput, cache hit rate, shard ids, sample hashes, mixture weights, padding, and bad-record counts. Replay the sampler for small world sizes and compare to production sample ids. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: All ranks reading identical batches, Python GIL bottlenecks, millions of tiny files causing metadata storms, validation competing with training storage, corrupted shards causing rank crashes, and data-order changes after resume. Diagnose with sample-id intersections and shard-boundary loss plots. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Raffel et al. 2020; Penedo et al. 2023; Soldaini et al. 2024; Narayanan et al. 2021; PyTorch/TensorFlow distributed input-pipeline practice..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 93 Budget first: a 16k-GPU H100 cluster at 40% MFU on a 70B model consumes roughly 6–7M tokens/sec — about 25–30 GB/s of tokenized data (uint16/32) cluster-wide. That is trivial aggregate bandwidth; data loaders fail not on bandwidth but on tail latency, metadata pressure, shuffling structure, and resume semantics. Design principles: (1) Precompute everything offline. Tokenize, pack into fixed-length sequences, and write index files at preprocessing time. The training-time loader should do: seek → read → collate. No tokenization, no regexes, no JSON parsing on the critical path (these burn host CPU that pipelines and NUMA contention make scarce). Formats: Megatron .bin/.idx (memory-mapped token arrays + document index), MosaicML MDS, WebDataset tars, or ArrayRecord — anything with O(1) random access by sample index and large sequential reads underneath. (2) Deterministic global sample order, computed not streamed. Megatron's three-level index (document order shuffle, sample-to-document mapping, shuffled sample index) materializes the entire run's order as memory-mapped arrays, seeded once. Properties you get for free: exact O(1) resume (skip to consumed_samples), no inter-worker coordination at runtime, per-rank disjoint coverage guaranteed, and reproducibility for debugging (replay the exact batch that caused a loss spike — invaluable). Mixture sampling: per-sequence categorical draw over sources by weight, also precomputable. If you need mid-run reweighting, precompute per-source indices and draw source assignment online from a seeded RNG keyed by global step. (3) Storage layout against incast and MDS storms. Thousands of ranks opening thousands of small files melts Lustre metadata servers. Use few large shards (256MB–1GB), memory-mapped or read with O_DIRECT large sequential I/O; replicate hot index files to local NVMe at job start; or front object storage with a caching layer (per-node NVMe cache, or a distributed cache tier like AIStore/alluxio). Each node should read ~independent byte ranges — avoid all nodes reading shard 0 first (stagger by rank). (4) Host pipeline: per-rank, 2–8 dataloader workers, pinned-memory staging buffers, prefetch depth 2–4 batches, async H2D copies on a side CUDA stream overlapping the previous step's compute. Pin workers NUMA-local to their GPU's socket (numactl or torch affinity); cross-NUMA page traffic is the most common silent host bottleneck on 8-GPU nodes. CPU budget: leave cores for NCCL proxies and the async checkpoint writer. (5) Resilience semantics: the loader must (a) fast-forward deterministically after restart (the index makes this trivial; streaming loaders must persist consumed-offsets per shard), (b) tolerate a slow/failed storage replica without stalling the step (timeout + failover read), and (c) validate data integrity — per-shard checksums at load; a corrupted shard should fail loudly, not feed garbage (a classic source of unexplained loss spikes; PaLM's spike-rollback-and-skip procedure presumes you can identify and skip data windows). Diagnostics to run continuously: per-rank dataloader wait time (target: <1% of step time; alarm at 3%), prefetch-queue depth (persistent zero = starved), storage read latency p99 per node, host CPU utilization per worker, and realized-vs-target mixture composition. Load-test before the run: replay the loader at 2x training speed for an hour across the full cluster — it must sustain that with flat p99s, because eval jobs, checkpoint writes, and a degraded storage node will all eventually contend with it.
Q024	83	How should I tune pipeline stages, virtual stages, and microbatches to minimize bubbles and memory pressure?	5.5 Pro judge: 87 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Pipeline tuning is bubble, balance, and memory management. Use enough microbatches to fill the pipeline, but not so many that GEMMs become tiny or activation memory explodes. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For p pipeline stages and m microbatches, bubble fraction is roughly (p-1)/(m+p-1) in simple schedules. Set m at least 4p and try 8p if memory allows; p=16, m=64 gives about 19% bubble and m=128 about 10%. Use 1F1B schedules and consider virtual pipeline stages/interleaving for load balance. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Balance by measured stage time, not layer count. Account for embeddings, final logits, MoE layers, and long-context attention. Sweep microbatch sequences per GPU {1,2,4,8} where memory permits. Track stage heatmaps, send/recv latency, bubble time, activation memory, and p99 step time. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: One slow stage throttling the job, microbatches too small for tensor-core efficiency, validation shapes breaking the schedule, sequence-length changes not retuned, and checkpoint metadata missing virtual-stage layout. Diagnose with per-stage traces and memory-fragmentation telemetry. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Huang et al. 2019 GPipe; Narayanan et al. 2019 PipeDream; Narayanan et al. 2021 Megatron-LM interleaving; Rajbhandari et al. 2020 ZeRO..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 93 Pipeline efficiency is governed by one identity: bubble fraction = (p − 1) / (m + p − 1) for 1F1B with p stages and m microbatches per step, reduced by a further factor of v with interleaved scheduling (v virtual stages per GPU): bubble ≈ (p − 1) / (v x m + p − 1) approximately. Every tuning decision trades this bubble against memory and communication. Stage count p: choose the minimum p that, combined with TP, fits weights+optimizer with ~10% memory headroom. Each extra stage adds bubble and another point-to-point boundary; halve p before you shave microbatch memory elsewhere. Balance stages by compute time, not layer count: the first stage carries the embedding and the last carries the LM head + loss (often 1.5–3x a transformer layer at large vocab); standard fix is giving first/last stages fewer transformer layers (Llama 3 and Megatron both rebalance this way; some stacks split the head off as its own fractional stage). A pipeline is balanced when per-stage forward times measured in a profile agree within ~5%; the slowest stage sets the clock for everyone. Microbatch count m: target m ≥ 4p as a floor (bubble ≤ 20% un-interleaved), m ≥ 8p where batch size allows. m is constrained from above by global_batch = m x microbatch_size x DP — at fixed global batch, more microbatches per pipeline means smaller DP or smaller per-microbatch size. Smaller microbatch size reduces kernel efficiency (GEMMs shrink), so there is an interior optimum: sweep microbatch_size in {1, 2, 4} x m accordingly and take measured tokens/sec. With grad accumulation folded into the pipeline schedule, the optimizer step (plus DP all-reduce, plus any not-overlapped clipping) sits in the bubble shadow — verify in traces. Interleaving (virtual stages): v chunks per GPU divides the bubble by ~v at the cost of v x more in-flight activation sets per GPU on early stages, and v x more P2P messages (smaller, latency-sensitive). v=2–4 is the practical band; beyond that, communication latency and scheduling overhead eat the gains (Narayanan et al. 2021 report ~10% throughput improvement at typical configs). Use it when m is capped by batch size and the bubble is still >10%. Memory pressure specifics: 1F1B keeps at most p in-flight microbatches of activations on stage 1, declining linearly toward the last stage — early stages are your memory hot spot; interleaving multiplies this by v. Levers in order: selective activation checkpointing on early stages only; offload of pipeline-held activations to host (effective because their reuse latency is long); reducing microbatch size on early stages is not possible (uniform), so rebalance layers instead. Newer schedules — zero-bubble pipelining / ZB-H1 and the DualPipe variant DeepSeek-V3 used to overlap MoE all-to-all — split the backward into weight-grad and input-grad parts to fill bubbles; adopt them only if your framework supports them natively; the debugging cost of a custom schedule is substantial. Diagnostics: (1) compute a theoretical bubble from (p, m, v) and compare with measured (sum of stage idle time / step time from a timeline trace); a measured bubble exceeding theory by >5 points means imbalance (re-split layers) or exposed P2P (check NCCL stream overlap, enable batched isend/irecv); (2) watch for the last-stage straggle pattern caused by loss computation at 256k vocab — fix by sharding the loss (vocab-parallel cross-entropy); (3) OOM on rank 0 only is the early-stage activation pile — raise checkpointing there first; (4) if raising m stops helping, you have hit the DP all-reduce critical path — rebalance DP bucket sizes or overlap further.
Q025	84	How do I estimate and improve model FLOPs utilization during a large-scale pretraining run?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. MFU is achieved model FLOPs per second divided by theoretical peak FLOPs for the precision used. Use it to diagnose efficiency, but optimize quality per wall-clock and dollar, not MFU alone. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For dense decoder pretraining, achieved FLOPs/s is approximately 6Nbatch_tokens/step_time, plus attention quadratic and vocab costs when material. Divide by accelerator_count * peak_BF16_or_FP16_FLOPs. Well-tuned large dense training often targets 35%-55% MFU; above 50% at large scale is strong. Report MoE active-FLOP MFU separately from total parameters. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Profile tensor-core utilization, memory bandwidth, NCCL all-reduce/all-to-all, dataloader wait, CPU overhead, checkpoint stalls, padding fraction, and kernel choices. Improve with FlashAttention-2/3, fused SwiGLU/norm, aligned dimensions, CUDA graphs/compilation, larger microbatches, sequence packing, communication overlap, and block-sparse/grouped MoE kernels. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: High MFU with poor validation, small models launch-bound, long-context attention lowering MFU but necessary for capability, MoE FLOPs miscounted, and p99 stragglers hidden by averages. Diagnose MFU by sequence length, topology, and batch shape. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Narayanan et al. 2021 Megatron-LM; Dao et al. 2022/2023 FlashAttention; Gale et al. 2022 MegaBlocks; Rajbhandari et al. 2020 ZeRO..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 92 Definition: MFU = achieved useful FLOPs/sec divided by the accelerator's peak (Chowdhery et al. 2022, PaLM appendix). Compute achieved FLOPs analytically, never from profiler counters: per token, a dense decoder transformer costs ≈ 6N FLOPs for forward+backward (N = non-embedding parameters), plus attention's 12 x L x h x s extra per token (with seq length s, L layers, hidden h) which matters at long context. So MFU = 6 x N x tokens_per_sec / (n_GPUs x peak_FLOPs). Use the dense BF16 peak without sparsity for honesty (989 TFLOPs on H100 SXM BF16). Distinguish MFU from HFU (hardware FLOPs utilization, which credits recomputation): if you use full activation checkpointing, HFU ≈ 1.33 x MFU; report MFU — recompute is overhead, not progress. Calibration anchors: well-tuned dense LM training achieves 38–45% MFU on A100 clusters at 2k seq (Megatron-LM 530B: ~30% HFU-era numbers; PaLM on TPUv4: 46% MFU; Llama 3 405B on 16k H100s: 38–41% MFU; MosaicML/torch-titan 7B-class single-node: 45–55%). MoE models run lower per active FLOP (DeepSeek-V3 era: ~20–30% typical) due to all-to-all and imbalance. If a dense run shows <30% on H100s, something specific is broken; >55% at multi-thousand-GPU scale should make you double-check your arithmetic (a common error: counting embedding params in N, or using the sparsity peak). Improvement is a budget-allocation exercise; profile first (one full step trace per pipeline stage with torch profiler/nsys), then attack the largest exposed-time slice: (1) Exposed communication. DP gradient all-reduce must hide under backward: increase bucket size (25–100MB), enable async reduce-scatter (ZeRO), check hierarchical algorithms for cross-rack. TP exposed time means TP spans too many or too slow links — reduce TP degree, enable SP. PP bubble: raise microbatch count/interleaving (see pipeline tuning). Symptom signature: per-kernel timeline shows NCCL kernels with no concurrent compute. (2) Suboptimal kernels. FlashAttention-2/3 mandatory (3 adds FP8 and warp-specialization on H100); fused residual+norm, fused AdamW (apex or torch fused), fused SwiGLU; torch.compile or Transformer Engine for the rest. GEMM efficiency: matrix dims multiples of 64/128 (pad vocab!); check achieved TFLOPs per GEMM in the profile against cuBLAS roofline — small microbatches give skinny GEMMs that cap at 60% of peak; raise microbatch if memory allows. (3) Recompute overhead. Move from full to selective activation checkpointing (checkpoint attention only): typically recovers 10–15% step time (Korthikanti et al. 2022). (4) Host-side stalls. Dataloader wait (prefetch, pinned memory, more workers), CPU launch overhead at small kernels (CUDA graphs help at small models), synchronous logging/eval blocking the stream (.item() calls every step), checkpoint stalls (async snapshot). (5) Stragglers and frequency. One slow GPU drags the collective: monitor per-rank step time max/median (alarm >3–5%); causes include thermal throttling, a downclocked HBM, single-rail NIC degradation. Also watch power capping on dense H100 racks. Track MFU continuously on the dashboard — its derivative is diagnostic: slow MFU decay over days suggests fragmentation or a degrading node; step drops localize to events (a checkpoint, an eval, a topology change). Treat every full percentage point seriously: on a 16k-GPU, 60-day run, 1% MFU is roughly 10 GPU-years.
Q026	85	How should I design distributed checkpointing so that saves and restores do not become cluster-wide bottlenecks?	5.5 Pro judge: 87 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 67 Treat this as a controlled engineering decision, not a preference. Distributed checkpointing must scale with sharded state and restore time. A checkpoint is useful only if it can be validated, remapped, and restored before lost work dominates. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Write parameters, optimizer states, scheduler, RNG, data cursor, topology, tokenizer/config hashes, and metadata as sharded files. Never gather to rank 0. Use transactional temporary files, checksums, and atomic manifests. Stage to local NVMe or burst buffer, then asynchronously copy to durable storage. Save lightweight weight-only eval checkpoints separately from full training checkpoints. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Measure save and restore bandwidth weekly. Run automatic restore drills that train 10-100 steps. Support topology-aware remapping for compatible DP/TP/PP changes. Retain last 5-10 full checkpoints plus milestones. Bound checkpoint bandwidth so it does not stall data loading or collectives. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Metadata storms from tiny files, object-store throttling, all ranks waiting at save barriers, converters all-gathering full tensors, missing data-loader state, optimizer shards mismatched to parameters, and silent dtype conversion. Diagnose with per-shard checksums and restore rehearsal logs. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Rajbhandari et al. 2020 ZeRO; Narayanan et al. 2021 Megatron-LM; PyTorch FSDP/distributed checkpointing; DeepSpeed ZeRO checkpointing; Lepikhin et al. 2020 GShard..	5.5 Pro judge: 97 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 91 Design constraint: at frontier scale a naive synchronous checkpoint is a cluster-wide stall measured in minutes (a 405B model with Adam state is ~5TB; pushing that through a parallel filesystem at 100 GB/s aggregate takes ~50 seconds in the theoretical best case, and far worse with metadata contention from thousands of writers). The design goals are: save stall < a few seconds, restore < a few minutes, no single-writer bottlenecks, and resharding on load. Architecture, four layers: (1) Fully sharded writes. Every DP/TP/PP rank (or one rank per replica group) writes only its own shards — torch.distributed.checkpoint (DCP), Megatron distributed checkpointing, or DeepSpeed's sharded format. Avoid rank-0 gather entirely. With ZeRO/FSDP, optimizer shards are already partitioned — write them in place. Watch filesystem metadata load: thousands of small files per checkpoint stresses Lustre MDS; aggregate to a few files per node or use object storage with key-per-shard. (2) Asynchronous, two-phase saves. Phase one: copy state to pinned host memory (GPU→CPU at ~20–50 GB/s per node, so a per-GPU share of a few tens of GB snapshots in seconds), barrier, resume training. Phase two: background threads/process streams the snapshot to durable storage while training proceeds. This is the Llama 3 approach (in-memory snapshots cutting effective overhead to seconds; Dubey et al. 2024) and is productized in torch async DCP, CheckFreq (Mohan et al. 2021), and Gemini (Wang et al. 2023, which adds peer-replication in RAM for fast recovery). Guard rails: budget host RAM (snapshot ≈ optimizer+weights bytes per node), throttle background writes so they do not contend with the dataloader's storage bandwidth, and fence phase two completion before deleting older checkpoints. (3) Hierarchical durability tiers. Tier A: RAM/local-NVMe snapshot every ~10–30 minutes (survives process crash; with peer replication, survives single-node loss — a la Gemini/MegaScale). Tier B: parallel FS or object store every 1–3 hours (survives any node loss). Tier C: cross-region/object archive daily plus permanent milestone checkpoints every 50–100B tokens (survives systemic disasters and serves science: annealing forks, eval sweeps, rollbacks). Interval math per tier via Young–Daly: t_opt = sqrt(2 x MTBF x overhead); with second-scale async overhead the optimal frequency becomes very high — in practice cap at the Tier A cadence for restart granularity, since replay time after failure is interval/2 in expectation. (4) Restore path engineering — usually the neglected half. Requirements: parallel sharded reads (every rank reads only its shards — restoring 5TB through rank 0 takes an hour you do not have); resharding support so a checkpoint saved at TP8/PP16/DP128 loads into a different layout after you lose nodes and reflow the job (DCP and Megatron dist-ckpt support this; pickle-based torch.save does not); deterministic dataloader fast-forward (precomputed sample index makes this O(1)); and a verified-manifest scheme — per-shard checksums written at save, validated at load, with the previous checkpoint retained until the new one passes verification. Silent shard corruption on parallel filesystems is a real, observed failure. Operational diagnostics: track save stall (p99 should be seconds), background-write duration (must be < save interval, else saves queue), restore time end-to-end (drill it monthly: kill the job, measure time-to-first-step), and replay loss (tokens between failure and last durable checkpoint). Alarm when background writes start overlapping consecutive intervals — the canonical precursor to checkpoint pileup and disk exhaustion.
Q029	86	How should I scale learning rate, batch size, and optimizer state when increasing a run from tens to thousands of GPUs?	5.5 Pro judge: 86 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. When scaling GPU count, preserve the mathematical optimizer regime first. Increase hardware scale by reducing accumulation to keep global batch tokens constant before increasing batch. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: If 64 GPUs used microbatch 2 and accumulation 32 for 1M tokens/update, a 1,024-GPU run may use accumulation 2 for the same global batch. Keep LR, warmup tokens, Adam betas, and optimizer moments fixed. If global batch must increase, test 2x-8x with LR multipliers; linear scaling may work modestly, sqrt scaling or fixed LR is safer at large jumps. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Compare loss versus tokens, not steps. Validate sample-id uniqueness across ranks after world-size changes. Use distributed optimizer/ZeRO/FSDP for Adam states and overlap optimizer communication. Track gradient noise, clipping fraction, update/weight ratio, MFU, p99 step time, and validation by domain. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Secretly increasing batch 16x and undertraining by updates, overaggressive LR scaling causing delayed divergence, local clipping, sampler repeating shards, optimizer states loaded onto wrong shards, and communication overlap disappearing at full scale. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Goyal et al. 2017; McCandlish et al. 2018; Kaplan et al. 2020; Rajbhandari et al. 2020; Narayanan et al. 2021..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 93 Scaling the cluster changes only one statistically meaningful quantity: the global batch size (assuming you scale DP). Everything else follows from how you handle that. Batch-size policy first: decide global batch in tokens from the critical-batch-size logic (McCandlish et al. 2018), not from the GPU count. The right mental model: the cluster size sets how fast you take steps; the batch size should be set by what the optimization can absorb. If 10x more GPUs tempts you to 10x the batch, check where you sit relative to B_crit (estimable from the gradient-noise scale: B_crit ≈ tr(Σ)/\|g\|², measurable online from per-microbatch gradient norms or DP-shard gradient disagreement). Below B_crit, scale batch freely; near it, extra batch buys nothing — prefer keeping batch fixed and taking faster wall-clock steps, or use the extra capacity for TP/PP/CP to train a bigger model or longer context. Published practice: batch grows sublinearly with model/cluster scale (GPT-3 175B: 3.2M tokens; Llama 3 405B on 16k GPUs: 16M tokens max, with staged ramps 4M→8M→16M as loss fell — exploiting the fact that B_crit grows as loss decreases). LR adjustment with batch: for Adam-family optimizers use square-root scaling — LR_new = LR_old x sqrt(B_new/B_old) — not the linear rule (linear scaling is an SGD result; for Adam, sqrt matches both theory on noise-dominated regimes and practice; Malladi et al. 2022; Granziol et al. 2022). Cap the scaled LR at the stability ceiling you established at small scale: the divergence threshold does not rise with batch, so at large batch you often end below sqrt-scaling. After any batch change, re-warm: ramp LR from ~50% to target over 100–500M tokens. Keep warmup and schedule defined in tokens so they are invariant to the resharding. Optimizer state mechanics: Adam m and v are per-parameter and travel with the parameters — resharding from DP=64 to DP=1024 with ZeRO means re-partitioning state files, which your distributed checkpoint format must support (torch DCP, Megatron dist-ckpt). Never reinitialize state on rescale: fresh second moments transiently amplify updates and produce a loss bump costing billions of tokens. If batch changed, state remains valid (m, v are per-token-scale estimates; the sqrt-LR rule absorbs the noise change). Gradient clipping threshold stays at 1.0 — the global norm is batch-averaged and roughly scale-invariant; but recheck the clip-trigger rate after rescale, since larger batches reduce gradient noise and the norm distribution tightens. Numerical and systems side effects to verify at the new scale: (a) loss-curve overlay test — run 1–2B tokens and overlay against the pre-scale trajectory; with unchanged batch it should match within noise; with changed batch expect a brief transient then parallel slope; (b) all-reduce determinism changes (different reduction orders change bitwise results — acceptable, but rules out bitwise regression testing); (c) DP collective time grows ~log with DP size — confirm overlap still hides it; (d) dataloader: the sample-index/sharding must repartition without repeating or skipping data — checksum the consumed-document set across the transition; (e) failure rate scales linearly with node count — recompute your checkpoint interval via Young–Daly with the new MTBF. The disciplined sequence: freeze a checkpoint → reshard → verify loss overlay at old batch → then change batch with sqrt-LR and re-warm → verify slope and eval parity over ≥10B tokens before declaring the migration done. One variable at a time, evals on both sides.
Q023	87	How do I select an efficient parallelism topology that matches the bandwidth hierarchy within and across nodes?	5.5 Pro judge: 85 Opus 4.8 Judge: 84 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Place communication-heavy dimensions on the fastest links. Tensor and sequence parallelism need low-latency high-bandwidth links; pipeline and data parallelism tolerate slower links better when overlapped; MoE expert all-to-all often dictates topology. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: On 8-GPU NVSwitch nodes, start TP=8 intra-node, PP across 4-16 nodes, and DP across replicas. For MoE, try expert parallelism inside node or rack and hierarchical all-to-all for many experts. Ensure heads divide TP, layers divide PP with embeddings accounted for, and expert counts divide expert-parallel ranks or use explicit uneven placement. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Estimate tensor all-reduce volume, data-gradient volume, pipeline activation sends, context-parallel KV traffic, MoE token dispatch volume, and checkpoint traffic. Measure p50/p95/p99 step time, NCCL time, overlap, GPU idle, network drops/retransmits, and MFU. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: TP groups spanning racks, all-reduce scheduled with no backward overlap, checkpoint I/O congesting MoE all-to-all links, pipeline cuts overloading long-context layers, and topology changes invalidating restore. Diagnose by collective traces and rank-local timelines. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Shoeybi et al. 2019; Narayanan et al. 2021; Rajbhandari et al. 2020; Lepikhin et al. 2020 GShard; Fedus et al. 2021; Gale et al. 2022 MegaBlocks..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 91 Match each parallelism dimension's traffic pattern to the physical bandwidth tier it can saturate without stalling compute. Modern clusters have a steep hierarchy: NVLink/NVSwitch within a node (900 GB/s/GPU on H100, bidirectional aggregate), then 8x400Gb/s InfiniBand NDR per node (~100 GB/s/GPU), then rail-optimized fat-tree or Dragonfly across racks where effective bisection bandwidth per GPU can drop further on oversubscribed tiers. Traffic taxonomy per dimension: TP exchanges activations twice per layer (all-reduce or AG/RS pairs) — per-step volume scales with batch x seq x hidden x layers; it is latency- and bandwidth-critical and cannot be overlapped much because compute depends on it serially → must live on NVLink. Megatron measurements show TP=8 across nodes can halve throughput versus intra-node. PP sends only stage-boundary activations point-to-point (microbatch x seq x hidden per hop) — small, overlappable with compute via 1F1B → ideal for the inter-node tier; map consecutive stages to topologically adjacent nodes to keep hops short. DP/ZeRO moves gradients/parameters once per step (size = model bytes / DP-shard layout), fully overlappable with backward if bucketed → tolerant of the slowest tier; hierarchical all-reduce (intra-node reduce, inter-node all-reduce on one rail per GPU index) exploits rail-optimized fabrics in which GPU i of every node shares a dedicated switch plane. CP exchanges KV blocks per attention layer — intermediate intensity → place inside the node or across few adjacent nodes. EP (MoE) all-to-all is the nastiest pattern (incast, bursty) → constrain routing so each token crosses to at most M nodes (DeepSeek-V3 capped at 4 nodes) and keep EP groups within a rail/rack domain. Concrete assignment recipe for an HGX/IB cluster: order dimensions in the process grid so that TP varies fastest (contiguous ranks within a node), then CP, then EP, then PP, then DP slowest. In Megatron terms, the default order TP→CP→EP→PP→DP exists precisely to realize this mapping; verify with the actual rank-to-host mapping, because a scheduler that scatters ranks destroys the design silently. On a 1,024-GPU H100 cluster for a 70B dense model: TP=8 (node), PP=8 (rack-adjacent nodes), DP=16 (cross-rack, hierarchical all-reduce, ZeRO-1). For 405B on 16k GPUs: TP=8, PP=16, DP=128, CP joined for long-context (the Llama 3 layout; Dubey et al. 2024). Quantitative check before committing — a back-of-envelope per tier: required TP bandwidth ≈ 2 x 2 bytes x batch_tokens_per_TPgroup x hidden x 2 ops x layers / step_time; compare against NVLink's ~450 GB/s usable per direction. Required DP bandwidth = 2 x model_bytes / (backward_time) for full overlap. If a tier's requirement exceeds ~60% of its measured (not nameplate) bandwidth — measure with nccl-tests all_reduce_perf at the relevant message sizes — you will see exposed communication; redesign. Topology-aware refinements: enable NCCL_ALGO=Tree for large cross-rack all-reduces and SHARP in-network reduction where available (offloads reductions to switches, ~2x on large messages); pin PP stage pairs to the same leaf switch; on Dragonfly systems prefer DP placement aligned to groups. Failure diagnostics: per-rank NCCL timing histograms — bimodal all-reduce durations indicate a slow link or misrouted rail; rising step time with cluster size at fixed per-GPU work indicates an unscalable collective (switch from flat ring to hierarchical); and always validate after any scheduler/maintenance event, because rank placement is the most commonly silently-broken assumption in long runs.
Q021	88	How do I combine data, tensor, pipeline, context, and sequence parallelism across a large GPU cluster?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. Compose parallelism according to the hardware hierarchy. Use data parallelism after memory fits, tensor parallelism inside fast NVLink/NVSwitch islands, pipeline parallelism across layers/nodes, sequence or context parallelism for long sequences, and ZeRO/FSDP for optimizer/parameter sharding. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: A dense 70B-style setup might use TP=8 within each 8-GPU node, PP=8 across nodes, and DP across pipeline replicas, with sequence parallel enabled. Keep head counts divisible by TP, balance pipeline stages by time not layer count, and add expert parallelism for MoE all-to-all within high-bandwidth groups. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Build a cost model for compute, tensor all-reduce, data all-reduce, pipeline activation sends, context-parallel attention communication, MoE all-to-all, and checkpoint I/O. Sweep 1-4 candidate topologies for 500-2,000 steps and record MFU, p50/p99 step time, memory headroom, communication overlap, and failure blast radius. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: TP across slow links, imbalanced pipeline stages, DP collectives spanning congested racks, context parallelism dominated by latency, MoE expert groups crossing weak bisection, and checkpoint formats tied to obsolete partitions. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Shoeybi et al. 2019 Megatron-LM; Narayanan et al. 2021; Huang et al. 2019 GPipe; Rajbhandari et al. 2020 ZeRO; Korthikanti et al. 2022; Dao et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 90 Compose parallelism dimensions according to their communication intensity, mapping the chattiest dimensions to the fastest interconnect. The canonical decomposition (Megatron-LM lineage: Shoeybi et al. 2019; Narayanan et al. 2021; Korthikanti et al. 2022): Tensor parallelism (TP): splits individual matmuls (column/row-parallel pairs) and requires two all-reduces per layer per forward+backward — the most communication-intensive dimension. Rule: TP degree ≤ GPUs per NVLink domain (8 on standard HGX nodes; up to 72 on GB200 NVL72). Never run TP over InfiniBand unless the model literally cannot fit otherwise. Sequence parallelism (SP, in the Megatron sense): partners with TP — shards the activations of LayerNorm/dropout regions along the sequence dimension, converting TP's all-reduces into all-gather + reduce-scatter at identical bandwidth cost but removing redundant activation memory. Always enable SP when TP > 1; it is free memory. Pipeline parallelism (PP): splits layers into stages; communication is only point-to-point activation transfers at stage boundaries — cheap, bandwidth-light, tolerant of inter-node links. Use PP across nodes. Costs: pipeline bubble of (p−1)/m fraction (p stages, m microbatches), so you need many microbatches; use 1F1B scheduling and interleaved virtual stages (v chunks per GPU divides the bubble by v) to push efficiency up. Data parallelism (DP): replicates the model; one gradient all-reduce (or reduce-scatter/all-gather with ZeRO) per step, overlappable with backward. Scales widest; its limit is global batch size — DP x microbatch x seq_len must stay under your token-batch target. ZeRO-1 (shard optimizer state) is nearly free and should be default; ZeRO-3/FSDP replaces TP+PP entirely at ≤13B scale and simple topologies (the Llama/FSDP path), but pure FSDP struggles past ~30B across many nodes versus 4D Megatron-style at equal hardware. Context parallelism (CP): shards the sequence dimension of attention itself (ring attention/all-gather-KV variants); needed when seq_len ≥ ~32k makes activations per GPU intolerable. Llama 3 used CP=16 for 128k-token training. Communication is KV exchange between CP ranks; place CP within or adjacent to the NVLink domain when possible; combine with load-balanced causal sharding (zigzag) so ranks get equal attention work. Expert parallelism (EP, if MoE): all-to-all token routing per MoE layer; bandwidth-heavy and bursty; keep EP within a node or rack-scale domain (DeepSeek-V3: EP across 8 nodes with custom kernels and node-limited routing). Composition order for a concrete large dense run (e.g., 70B–400B on 4k–16k H100s): TP=8 (intra-node) x PP=8–16 (inter-node) x DP=remainder with ZeRO-1, SP on, CP only in the long-context stage. Llama 3 405B used exactly TP=8, CP=16 (long-context only), PP=16, DP to fill 16k GPUs, in that bandwidth order (Dubey et al. 2024). Sizing procedure: (1) memory: weights+optimizer per GPU = 16N/(TP x PP x DP_shard_factor) bytes — choose minimal TP x PP that fits with headroom; (2) set m (microbatches) ≥ 4 x p for bubble <20%, ideally m ≥ 8p; (3) verify DP all-reduce overlaps with backward (profile: comm streams should hide under compute); (4) sweep ±1 settings of TP/PP around the analytic optimum for 30-minute trials — measured tokens/sec beats any model. Diagnostics: MFU under 35% on dense H100 → suspect bubble (raise m or v), exposed DP comm (enable overlap, increase bucket size), or TP over slow links. Per-rank timelines (one profile per pipeline stage) localize which.
Q022	89	How should I allocate a fixed compute budget between model parameters and training tokens?	5.5 Pro judge: 85 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Allocate compute with scaling laws, then adjust for data quality, inference economics, and failed-run risk. Dense decoder training FLOPs are approximately 6ND, where N is non-embedding parameters and D is tokens. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Use the Chinchilla-style prior of roughly 20 training tokens per parameter for compute-optimal dense training: 7B wants about 140B tokens, 70B about 1.4T. Reserve 5%-15% for ablations, failures, evaluation, checkpoint overhead, and post-training. If inference cost dominates, prefer a smaller model trained longer; if one-time capability dominates, a larger undertrained model may be rational. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Fit your own L = L_inf + aN^-alpha + bD^-beta on 100M-10B proxies with multiple token budgets and losses by domain. Optimize under measured MFU, not vendor peak. For MoE, model active FLOPs, total parameters, router overhead, all-to-all, and expert undertraining separately. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Launching one final run before mixture validation, using public benchmarks as scaling-law targets, ignoring data repetition, assuming dense laws apply to MoE, and forgetting post-training compute. Diagnose with proxy-to-final calibration and confidence intervals. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Henighan et al. 2020; Hoffmann et al. 2022 Chinchilla; Fedus et al. 2021 Switch Transformer; Dai et al. 2024 DeepSeekMoE; Guo et al. 2025 DeepSeek-R1..	5.5 Pro judge: 97 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 89 Start from Chinchilla (Hoffmann et al. 2022): for a fixed training-compute budget C ≈ 6ND, the loss-optimal allocation is roughly N* ∝ C^0.5, D* ∝ C^0.5, landing at approximately 20 tokens per parameter (their three estimation approaches gave a_N 0.46–0.55; the famous 70B/1.4T pairing). Epoch AI's replication (Besiroglu et al. 2024) corrected Approach 3's fit and confirmed ~20 t/p as the compute-optimal neighborhood, with wide flat basins: ±2x off-optimal in N at fixed C costs only a few percent in effective compute. So as a pure training-loss matter: parameters = sqrt(C/120) with D = 20N, and precision beyond a factor of 2 buys little. But the loss-optimal point is not the decision point, for three reasons you must price in: (1) Inference amortization. Tokens served over the model's lifetime shift the optimum hard toward smaller, longer-trained models (Sardana & Frankle 2023; De Vries 2023). The dominant industrial regime is now massive overtraining: Llama 3 8B at ~1,875 t/p, Llama 3 70B at ~215 t/p, Qwen/Mistral similar. The loss penalty for overtraining is well-behaved and predictable from the scaling fits; a 3x-overtrained model typically gives up the equivalent of <10–15% compute versus optimal while halving or better the serving cost per token. Decision rule: estimate expected lifetime inference tokens I; as I/D grows past ~1, slide along the iso-loss frontier toward smaller N until marginal training-compute increase equals marginal inference savings. (2) Data availability and quality. The 20 t/p ratio assumes unlimited fresh data of constant quality. If your deduplicated high-quality corpus caps D, repeated epochs follow data-constrained scaling (Muennighoff et al. 2023): up to ~4 epochs behaves near-fresh, with returns decaying such that the effective unique-data multiplier saturates around 4x; fold this into the optimization by replacing D with D_effective. If data binds, grow N above the naive optimum — parameters substitute for data at a quantified exchange rate in their fitted law. (3) Your own constants. Chinchilla's coefficients were fit on their data/architecture/tokenizer; mixture quality shifts both the constants and (mildly) the exponents (DCLM and FineWeb both demonstrate large vertical shifts in the loss-compute frontier from data alone). Refit locally: train an isoFLOP grid — 4–5 compute levels spanning ~2 orders of magnitude (e.g., 3e18 to 3e20 FLOPs), 4–6 model sizes per level, identical mixture/tokenizer/schedule (cosine matched to each run length, per the Chinchilla critique of Kaplan's fixed-schedule artifact, which is what produced the old 1.7 t/p answer) — then fit L(N,D) = E + A/N^alpha + B/D^beta and minimize subject to 6ND = C. Total cost of the grid: a few percent of the main run; do not skip it for a frontier commitment. Practical adjustments: count embedding parameters consistently with your fit; for MoE, use active parameters in the 6ND accounting but fit MoE-specific laws if possible (expert count shifts optima); reserve 5–10% of D for the annealing phase's high-quality upweighting. Sanity diagnostics: your final loss should land within ~0.01–0.02 nats of the fitted prediction — larger misses mean a constants problem (data drift, schedule mismatch) worth investigating before blaming the law.
Q027	89	How do I diagnose stragglers, communication stalls, and intermittent hardware failures in multi-thousand-GPU training?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. At thousands of GPUs, a training step is a distributed systems measurement. Instrument per-rank timelines and correlate anomalies with data, ranks, hardware, network, and schedule phase. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Record forward, backward, optimizer, data wait, collective wait, pipeline send/recv, MoE all-to-all, checkpoint, and validation durations per rank. Collect NVML/DCGM telemetry: ECC/XID, temperature, clocks, power throttling, PCIe replay, HBM bandwidth, and utilization. Enable sampled NCCL traces with collective type, size, group, algorithm, retries, and latency. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use p50/p95/p99 step time and slowest-rank attribution. Run maintenance microbenchmarks for matmul, all-reduce, all-to-all, storage read, and synthetic end-to-end training. Maintain a bad-node database, heartbeat counters, and canary jobs. Quarantine nodes with repeated corrected errors or anomalous p99 behavior. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: One bad GPU/NIC/SSD stalling all ranks, MoE router skew amplifying all-to-all, storage page cache interfering after checkpoint, scheduler placement crossing weak links, and data shards causing rank-specific crashes. Diagnose by asking whether the anomaly follows source id, physical node, or topology. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Narayanan et al. 2021; Rajbhandari et al. 2020; Lepikhin et al. 2020; Fedus et al. 2021; Gale et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 90 At multi-thousand-GPU scale, assume something is always broken; the question is whether your telemetry can localize it faster than it costs you. Llama 3's published operations data set expectations: 466 job interruptions over 54 days on 16k H100s — 78% from confirmed/suspected hardware, with GPUs (and HBM) the leading cause (Dubey et al. 2024); MegaScale (Jiang et al. 2024) and ByteDance/Alibaba reports corroborate. Instrument three layers continuously: (1) Per-rank step telemetry. Log per-rank: step time, time-in-NCCL per collective, dataloader wait, memory high-water, SM clock/temperature, and power draw, at every step (cheap) with rollups. The single most useful chart is the per-rank step-time heatmap (ranks x time). Signatures: one rank persistently slow → host or GPU issue on that node (thermal throttle, downclocked HBM after ECC remap, a busy neighbor process, single NIC rail degraded); one pipeline stage slow → imbalance, not hardware; all ranks slow in lockstep → shared resource (storage burst, network congestion, power cap); random ranks slow at random times → host-side noise (kernel daemons, page cache pressure) or fabric congestion. (2) Collective-level diagnosis. Because training is bulk-synchronous, one straggler makes everyone's all-reduce slow, and naive per-rank NCCL timing shows everybody waiting — it tells you something is slow but not who. Tools and tricks: PyTorch flight recorder (TORCH_NCCL_TRACE_BUFFER_SIZE) records recent collectives per rank with enqueue/start/end times — the rank whose kernel starts late is the culprit, the ones that start on time and finish late are victims; barrier-bisection (insert timed barriers at section boundaries to localize the slow region); nccl-tests sweeps on suspect node subsets out-of-band; MegaScale-style per-step CUDA events on compute vs comm streams. For hangs (the worst class): enable NCCL async error handling + a watchdog (TORCH_NCCL_HEARTBEAT) so hangs become crashes with stack dumps rather than silent 30-minute stalls; on timeout, dump py-spy/gdb stacks of all ranks — the rank not in the collective is your bug. (3) Hardware health stream. Continuously scrape: DCGM metrics (XID errors — XID 63/64 row-remap events, XID 48 double-bit ECC, XID 79 fell-off-bus), HBM corrected-ECC rate (a rising corrected rate predicts uncorrectable failures — drain proactively), NVLink/NVSwitch error counters and replay counts, IB port counters (symbol errors, link downs, congestion/ECN marks), host dmesg (PCIe AER, thermal), and storage latency percentiles. Map every alert to rank/job topology automatically. Process discipline that the published record converges on: (a) pre-flight burn-in — before the job starts and after any node swap, run an automated suite (GEMM stress at full power, NCCL all-reduce at multiple message sizes pairwise/ring, HBM bandwidth test, dataloader read test) and reject outliers >3% off fleet median; a large fraction of stragglers are catchable here. (b) Hot-spare pool sized ~1–2% of fleet, automated drain-replace-resume so an interruption costs minutes. (c) Silent-data-corruption defense: SDC (computation errors without any error signal) occurs at small but nonzero rates at fleet scale (documented by Meta and Google fleets; Llama 3 attributed at least one loss-spike class to it) — periodically run checksum/redundant-compute canaries on shadow batches, and treat unexplained non-reproducible loss spikes as possible SDC: roll back, replay the same data; if the spike does not reproduce, suspect the node, not the data. (d) Postmortem every interruption into a labeled incident log — the distribution tells you where to invest next.
Q036	91	How should I extend context length after pretraining while preserving short-context quality and training stability?	5.5 Pro judge: 87 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Context extension is a phase change in model behavior and systems cost. Decide whether you need retrieval, long reasoning, or simply a larger serving window, then adapt positional encoding and data curriculum deliberately. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For RoPE models, test position interpolation, NTK-aware scaling, YaRN, LongRoPE-style scaling, or retraining with larger RoPE base. Move from 4k to 32k/128k through intermediate lengths such as 8k and 16k. Continue training for roughly 1%-5% of original tokens for moderate extension, more for extreme context, with 50%-80% short/medium sequences retained. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Retune microbatch, accumulation, LR, activation checkpointing, FlashAttention, sequence/context parallelism, and GQA. Evaluate perplexity by position, passkey retrieval, long-document QA, multi-hop distant evidence, repository code tasks, summarization with distractors, long conversations, and short-context regression. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Naively increasing max position, models retrieving strings but not reasoning, degradation in the middle of context, repetition beyond old context length, local syntax loss from RoPE scaling, and synthetic needle overfitting. Diagnose attention entropy by distance and answer accuracy by evidence position. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Su et al. 2021 RoPE; Press et al. 2021 ALiBi; Dao et al. 2022 FlashAttention; Chen et al. 2023 position interpolation; Peng et al. 2023 YaRN; Ding et al. 2024 LongRoPE; Liu et al. 2023 Lost in the Middle..	5.5 Pro judge: 97 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 93 Extend context in explicit stages after the bulk pretrain, using RoPE rescaling plus a long-document continued-pretraining mixture, and gate each stage on both long- and short-context evals. The standard recipe (the Llama 3 / Qwen / Yi lineage): pretrain at 4–8k; then one or more continued-pretraining stages at increasing lengths (e.g., 8k → 32k → 128k, Llama 3 used six stages to 128k; Dubey et al. 2024), each stage training 10–100B tokens (Llama 3 spent ~800B total on long-context stages for 405B; at 7–8B scale, 20–50B tokens to reach 128k is typical — Fu et al. 2024 "Data Engineering for Scaling Language Models to 128K Context" show even 1–5B carefully-mixed tokens suffice for usable 128k). RoPE adjustment at each boundary — pick one: (a) raise theta (base frequency): the simplest and what Llama 3 did (500k base raised to 8M for 128k); rule of thumb: theta should scale at least linearly with the length multiple, more by NTK reasoning; (b) NTK-aware interpolation / YaRN (Peng et al. 2023): rescales frequencies non-uniformly (interpolate low frequencies, preserve high) with attention-temperature correction (YaRN's sqrt(1/t) logit scaling); best published zero-shot-plus-light-finetune quality; (c) plain position interpolation (Chen et al. 2023): works but degrades high-frequency resolution — superseded. Implement rotation in FP32 — BF16 angle error at position 100k+ is material. If you anticipated extension at pretrain time (theta 500k from the start), each stage needs only modest rescaling. Data engineering is where extensions actually fail. Requirements: (1) genuinely long documents upweighted — books, code repositories serialized whole, long papers, multi-document threads; naive length-upsampling skews domain composition (long docs are mostly books/code), so balance per-domain (Fu et al. 2024's key finding); (2) keep 50–80% of the stage mixture at short/original lengths — this is the main defense for short-context quality, which otherwise regresses measurably; (3) for 64k+ where natural documents thin out, add structured concatenations (related-document packing, repo-level code ordering) and synthetic long-range tasks (fill-in-the-middle across distance, long-range retrieval QA) sparingly; (4) use document-boundary attention masking when packing unrelated short docs so they do not pollute long-range attention statistics. Training-stability specifics at long length: keep global batch in tokens constant (fewer sequences per batch); switch on context parallelism (ring/zigzag attention) when activation memory forces it (Llama 3: CP=16 at 128k); LR low — continue at the final pretraining LR or slightly below (e.g., 10–30% of peak) with a short re-warm; expect attention-entropy shifts — monitor max attention logits, and if spikes appear, YaRN's temperature term or QK-norm addresses them; verify loss-vs-position curves on held-out long docs (loss should decrease monotonically with position; a rising tail localizes positional breakage). Gating evals per stage: needle-in-a-haystack grids (position x depth — demand near-100% before advancing; it is necessary-not-sufficient), RULER (Hsieh et al. 2024) for graded difficulty beyond needles, long-document QA/summarization (e.g., InfiniteBench subsets), perplexity-vs-position curves, and the full short-context battery with a hard no-regression gate (>0.5–1 point drops on MMLU/HellaSwag-class evals block the stage). The two canonical failure modes: passing needles while failing aggregation tasks (you trained retrieval, not usage — add RULER-style multi-hop data), and short-context regression discovered after the fact (insufficient short data in the mix — re-anneal with a corrected mixture).
Q038	92	How should I coordinate data refreshes, tokenizer revisions, architecture changes, and optimizer state across successive frontier runs?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Treat data refreshes, tokenizer revisions, architecture changes, and optimizer-state migrations as separate interventions with compatibility contracts. Never bundle them without ablations unless the old system is unusable. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Maintain a manifest pinning data snapshot, filters, dedup thresholds, tokenizer ids/reserved tokens, architecture, initialization, optimizer, LR schedule, topology, and eval suite. Data refreshes should be additive: compare language, domain, source, quality, length, toxicity, benchmark overlap, fertility, and time. Tokenizer changes usually require a new model or controlled embedding migration. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Run proxy models on old, new, and mixed data. For tokenizers, initialize overlapping token embeddings where possible, reset optimizer states for changed rows, and compare byte/character-normalized losses rather than token losses. For architecture changes, reset states for new parameters and use lower LR/ramp. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Untraceable refreshes, tokenizer drift breaking tool protocols, comparing losses across tokenizers, inheriting optimizer moments after objective changes, and checkpoint converters that cannot reverse. Diagnose with migration tests and lessons-learned databases across runs. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kudo and Richardson 2018 SentencePiece; Raffel et al. 2020; Hoffmann et al. 2022; Touvron et al. 2023; Chen et al. 2023; Narayanan et al. 2021..	5.5 Pro judge: 96 Opus 4.8 Judge: 92 Gemini 3.1 Pro judge: 90 Successive frontier runs are a versioning problem; treat every ingredient as an artifact with semver, provenance, and migration rules, and never change more than necessary per generation. Data refreshes: maintain the corpus as versioned immutable snapshots (dataset vX.Y = source manifests + filtering-pipeline commit + dedup parameters + hashes). Between runs: add new crawl dumps and synthetic data, re-run the full pipeline including cross-snapshot dedup and re-decontamination against the updated eval inventory. Two rules: (1) quantify before adopting — run the standard 1B-proxy A/B (new vs old snapshot, 20–50B tokens, fixed eval battery) for every refresh; data regressions from upstream changes (a broken extractor, a source gone paywalled, synthetic data drift) are common and silent; (2) keep an unchanged "bridge subset" across generations so you can attribute model deltas to data vs architecture (train one proxy on old-data/new-arch and new-data/old-arch when generations change both). Tokenizer revisions: the most coupling-heavy change — it invalidates loss comparability, embedding weights, data caches, and any token-count-denominated config. Policy: revise rarely (once per major generation at most), batch all tokenizer wishes (vocab growth, digit handling, special-token reservations, new-language coverage) into one revision, and re-tokenize the full corpus rather than mixing tokenizations. When comparing across the boundary, use bits-per-byte, never per-token loss. If you must extend vocab on an existing model (adding domain/special tokens), initialize new embeddings to the mean of their BPE-decomposition embeddings and fine-tune briefly — but for a fresh pretrain, prefer the clean break. Reserve generous special-token space this time to lower the pressure next time. Architecture changes: gate through the scaling-ladder protocol — any change (attention variant, MoE-ification, norm placement, activation) must show neutral-or-better fitted scaling behavior across ≥3 proxy sizes on the new data snapshot, plus stability-envelope parity (max-LR tolerance, spike rate at the largest proxy). Maintain a frozen "reference architecture" config per generation; all ablations diff against it. Resist accumulating untested micro-changes between generations — the published pattern (Llama 2→3, Qwen generations) is notable architectural conservatism with effort reallocated to data; that pattern reflects measured returns. Optimizer state and weight reuse across runs: default is fresh training per generation — optimizer state does not transfer across architecture/tokenizer changes, and warm-starting from an old model biases scaling behavior unpredictably. Exceptions with established recipes: continued pretraining on refreshed data (same arch/tokenizer — re-warm LR to 10–50% of original peak, blend 5–30% old-distribution replay data to control forgetting; Ibrahim et al. 2024 "Simple and Scalable Strategies to Continually Pre-train"); depth/width growth (HyperCloning, G_stack-style function-preserving expansion) — viable at small-to-mid scale, sparse evidence at frontier scale; distillation from the previous flagship into the new pretrain mix (logit or synthetic-data distillation — the Gemma/Llama-3.2 pattern) which transfers capability without transferring weights. Cross-run process scaffolding: a generation manifest binding (data vX, tokenizer vY, arch vZ, harness vW) per run; a permanent model zoo of milestone checkpoints from every generation re-evaluated on every new harness version (so eval drift is measurable); carry-forward of the previous run's incident log and spike postmortems into the new run's monitoring thresholds; and a standing rule that pretraining-loss comparisons across generations are only made via matched-proxy replications, never via headline numbers — tokenizer, data, and context-length differences make raw cross-generation loss deltas meaningless.
Q039	94	How do I train a model to use retrieval, external tools, and structured outputs reliably during post-training?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Train retrieval, tool use, and structured outputs as stateful actions. The model must learn when to act, what arguments to emit, how to incorporate observations, and when to abstain. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For retrieval, train with queries, passages, provenance, conflicting evidence, no-answer cases, and contrastive negatives. For tools, define typed JSON schemas and include successful calls, invalid-call corrections, tool errors, retries, and no-tool cases. For structured outputs, use JSON Schema, SQL parsers, type checkers, grammars, or constrained decoding for critical interfaces. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Measure tool-call precision/recall, argument accuracy, unnecessary-call rate, recovery from errors, retrieval support, citation precision, schema validity, semantic correctness, cost, and end-to-end task success. Include stale retrieval, unavailable APIs, contradictory documents, rate limits, prompt injection, Unicode, and nested quotes. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Spurious retrieval, copying irrelevant passages, hallucinated tool observations, SQL/tool injection, JSON that parses but is semantically wrong, overcalling tools because reward ignores latency, and policies blocking necessary tools. Diagnose with environment simulations and state-transition validators. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Lewis et al. 2020 RAG; Nakano et al. 2021 WebGPT; Schick et al. 2023 Toolformer; Yao et al. 2022 ReAct; Rafailov et al. 2023 DPO; Ouyang et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 88 Reliable tool use is trained in three layers: format mastery (SFT), decision quality (preference/RL), and robustness (adversarial and failure-injected training). Most production failures trace to skipping the third. Schema and format foundation: fix the interface before generating any data — reserved special tokens for tool-call boundaries, a strict serialization (JSON with a published schema, or typed function-call format), and a parser that is the single source of truth in both training-data validation and serving. Constrained decoding (grammar/JSON-schema-guided) at inference is your safety net, but do not rely on it during training evaluation — measure raw schema-validity rate (target >99% post-SFT; if you need constrained decoding to hit that, the model has not learned the format). SFT stage: 100k–1M+ trajectories spanning: single calls, parallel calls, multi-turn tool chains, retrieval-augmented answers with citation, and — critically — negative-space examples: prompts where the right action is no tool call (the dominant failure mode after naive tool-SFT is gratuitous tool calls; include 20–40% no-tool examples), prompts where the model must ask a clarifying question, and tools that should be refused. Source trajectories from: strong-model generation validated by execution (run every call against real/simulated tools; discard non-executing trajectories), tau-bench/AgentInstruct-style simulated environments, and real anonymized usage if available. Loss-mask tool outputs (the environment's tokens) — train only on the model's decisions; train on the model's reasoning-before-call if you want inspectable behavior (ReAct-style; Yao et al. 2022). Retrieval specifically: train the citation contract — answers grounded with explicit quote/citation spans from retrieved documents, plus abstention when retrieval comes back empty/irrelevant (include such cases at 10–20% rate; otherwise the model hallucinates around missing evidence). Mix retrieval-on and retrieval-off versions of the same questions so the model learns when retrieval changes the answer. Evaluate with attribution metrics (citation precision/recall against gold spans), not just answer accuracy. RL stage (this is where decision quality comes from): verifiable end-to-end rewards in simulated environments — task completion in tau-bench-style user simulations, BFCL-style call-accuracy, code-interpreter task success, multi-step web/API workflows with state checks. Reward design: dominant weight on final task success; schema-validity as a hard constraint (invalid call = episode failure, not partial credit); small per-call cost to suppress tool spam; penalties for fabricating tool outputs (check: model must not emit observation-role tokens itself — explicitly adversarially test this). GRPO/PPO settings as in standard RLVR; sample multi-turn episodes with the real parser and simulated tool latencies/failures in the loop. Robustness layer — inject every failure you will see in production into training and eval: tool timeouts, 5xx errors, malformed tool outputs, empty retrievals, schema-version mismatches, distractor tools (similar names/signatures), and prompt-injected tool outputs (retrieved documents containing instructions — train the model to treat tool output as data, not instructions; evaluate with injection suites). Target behaviors: retry-with-backoff, graceful degradation to no-tool answers, explicit error reporting — each must appear in SFT data and be rewarded in RL. Evaluation gates: BFCL (function-call accuracy incl. parallel/irrelevance splits), tau-bench (pass^k for multi-trial reliability — pass^8 exposes flakiness that pass@1 hides), internal injection/robustness suites, schema-validity under temperature, and over-triggering rate on a no-tool-needed set. Track reliability as pass^k, not pass@k: agents compound per-step error, so a 95%-per-step model is a 60%-per-10-step model — the metric must reflect that.
Q031	95	How do I design and validate the data mixture and curriculum for a frontier pretraining run?	5.5 Pro judge: 83 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. The frontier data mixture is a capability portfolio. Each bucket needs an owner, data card, inclusion policy, dedup method, contamination report, quality distribution, legal/policy status, and expected capability contribution. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Define axes such as broad knowledge, code, math, multilinguality, long context, tool use, safety priors, and style. Start with a broad early mixture, targeted mid-training upweights, and a final high-quality anneal over 1%-5% of tokens. Retain 30%-60% broad distribution during anneal and use temperature sampling plus repeat caps for multilingual data. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Train 100M-3B proxy grids and, if budget allows, 7B-30B probes on candidate mixtures at equal tokens. Estimate marginal value per extra 1% tokens by replacement ablations. Track realized mixture online and compare to target. Use private holdouts inaccessible to mixture designers. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Proxy/tokenizer mismatch, overfitting to public evals, synthetic data amplifying teacher errors, repeated low-resource data overfitting, and last-minute refreshes breaking reproducibility. Diagnose with capability deltas by bucket and safety/robustness gates. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Hoffmann et al. 2022; Raffel et al. 2020; Gao et al. 2020; Penedo et al. 2023; Costa-jussa et al. 2022; Lewkowycz et al. 2022; Liang et al. 2022..	5.5 Pro judge: 97 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 91 Treat the mixture as a two-phase object — a bulk mixture for 80–90% of tokens and an annealing mixture for the decayed tail — and validate both with proxy models before committing frontier compute. Bulk mixture construction: (1) Inventory sources with deduplicated token counts and quality tiers (filtered web at several tiers à la FineWeb-Edu/DCLM-baseline, code by language, math (OpenWebMath, proof corpora), books, papers, multilingual web per language, forums/QA). (2) Apply hard constraints: ≤4 epochs per source over the full budget (Muennighoff et al. 2023); known-quality floors (drop bottom web tiers entirely — DCLM showed top-10% classifier selection beats using more, worse data); licensing/safety exclusions. (3) Set initial weights from precedent — order-of-magnitude anchor: 45–55% general web, 15–25% code, 10–25% math+reasoning-dense text, 5–10% multilingual, remainder books/papers/QA (Llama 3 disclosed ~50/25/17/8 across knowledge/math-reasoning/code/multilingual). (4) Optimize with proxy sweeps: 1B-parameter models, 20–50B tokens each, one per candidate mixture (8–20 candidates; include DoReMi-derived weights as one candidate, Xie et al. 2023); read per-domain validation loss plus the smooth early-signal eval set (likelihood-scored HellaSwag/ARC/MMLU-cloze etc.). Mixture rankings transfer across 1–2 orders of magnitude of scale with good fidelity; confirm the top-2 candidates at 7B-for-100B-tokens before the final call. Bias the final choice toward the mixture better on reasoning-dense evals at equal aggregate loss — web-loss differences wash out; reasoning differences compound downstream. Curriculum: keep the bulk phase mostly stationary — the evidence for elaborate early curricula is weak, and stationarity preserves your ability to interpret loss curves and scaling fits. The validated curriculum effects worth using: (a) annealing-phase upweighting — during the final 10–20% of tokens (under LR decay), upweight highest-quality sources 2–10x: curated math (with worked solutions), high-grade code, instruction-formatted data, textbook-like web. MiniCPM documented the technique (Hu et al. 2024); Llama 3 and OLMo 2 used variants; gains of several points on GSM8K/MMLU from the anneal alone are typical at 7–70B. Hold out this premium data from the bulk phase so it is fresh at anneal time. (b) Context-length staging (short → long late). (c) Optionally a brief warm-start on cleaner data in the first 1–2% of tokens to stabilize early training; keep it small to avoid distorting scaling comparisons. Validation infrastructure that makes this trustworthy: per-domain validation shards (mixture-invariant yardsticks across all experiments), the frozen early-signal eval harness, decontamination of every source against eval sets (13-gram), and a written mixture spec (source manifests, weights, epoch counts, hashes) under version control with the run. In-run monitoring and the annealing experiment trick: because you froze milestone checkpoints, you can fork cheap annealing branches mid-run — take the latest checkpoint, anneal 10–50B tokens on a candidate final mixture, eval, discard — to choose the production anneal empirically (OLMo 2 and the "annealing as mixture evaluation" methodology; also Blakeney et al. 2024 "Does your data spark joy?"). Budget ~1–3% of total compute for these forks; they are the highest-information experiments available in the back half of a frontier run. Failure modes: a single source exceeding ~30% causing measurable regression elsewhere; silent epoch inflation when upstream filtering shrank a source after weights were set (recompute epochs from final token counts); and contamination entering via "high-quality" synthetic or QA-formatted sources — audit those hardest.
Q033	95	How do I set up supervised fine-tuning, preference optimization, RLHF, and RLVR for a reasoning-focused model?	5.5 Pro judge: 84 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. For reasoning models, stage post-training: SFT for interface, preference optimization for helpfulness/safety/style, then RL with verifiable rewards for domains where correctness can be checked. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: SFT from a strong base with LR 5e-6 to 2e-5, batch 128-2,048 sequences, 1-3 epochs, weight_decay 0-0.1, and dropout usually 0. DPO beta should be swept around 0.01-0.2. PPO-style RLHF often uses clip 0.1-0.2, KL target 0.01-0.1 nats/token, and policy LR 1e-6 to 1e-5. GRPO/RLVR can sample 4-16 completions per prompt. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use high-quality demonstrations, pairwise preferences, reward-model ensembles, verifiable math/code/tool tasks, and KL to a reference. Track pass@1/pass@k, response length, refusal rate, tool validity, reward versus ground truth, and base capability regressions. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Too many SFT epochs damaging code/math, DPO length bias, reward-model overoptimization, unreadable pure-RL reasoning, language mixing, capability collapse on non-RL domains, and unsafe helpfulness gains. Diagnose with private reasoning sets and KL/length curves. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Ouyang et al. 2022 InstructGPT; Bai et al. 2022 Constitutional AI; Rafailov et al. 2023 DPO; Schulman et al. 2017 PPO; Shao et al. 2024 DeepSeekMath/GRPO; Guo et al. 2025 DeepSeek-R1..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 87 Run the canonical four-stage post-training pipeline, with RLVR carrying the reasoning load: SFT → preference optimization → RLVR, with RLHF-style reward modeling reserved for the non-verifiable dimensions (helpfulness, style, safety). SFT: 0.5–10M high-quality examples; for reasoning-focus, the decisive ingredient is long chain-of-thought traces — either human-engineered or distilled/rejection-sampled from a stronger model or from your own model via best-of-N with verifier filtering (the "cold start" data in DeepSeek-R1, Guo et al. 2025, was precisely curated long-CoT SFT before RL). Hyperparameters: LR 1–2e-5 (10–20x below pretraining final), cosine or linear decay, 1–3 epochs (watch eval loss — SFT overfits fast), global batch 0.25–1M tokens, sequence length covering your longest traces (16k–32k for reasoning), loss masked to assistant tokens only, dropout 0–0.1. Mix 5–15% pretraining-style data or general SFT to limit catastrophic forgetting. Diagnostics: per-source eval losses, plus formatting-compliance rate of sampled outputs. Preference optimization: DPO (Rafailov et al. 2023) on pairwise preferences for chat quality/safety: beta 0.05–0.3 (sweep; 0.1 default), LR 5e-7 to 1e-6 (much lower than SFT — the most common DPO mistake is SFT-scale LRs), 1 epoch, reference model = the SFT checkpoint, frozen. Known pathology: DPO drives down chosen-response likelihood too (monitor both margins and absolute logprobs); mitigations include adding a small SFT loss term or cDPO/IPO variants. Llama 3 used iterative rounds of rejection-sampling SFT + DPO; that loop is a strong, simple baseline. RLHF (learned reward model): train an RM from the preference data (Bradley-Terry loss on pairs; initialize from the SFT model; LR ~9e-6, 1 epoch — RMs overfit in <2 epochs), then PPO: policy LR ~1e-6, KL penalty against the SFT reference with target KL 3–10 nats per response (adaptive controller), GAE lambda 0.95, clip 0.2, 1–4 PPO epochs per batch of rollouts, batch ≥ 256 prompts. Keep a frozen eval RM and held-out prompts to detect RM overoptimization (Gao et al. 2023 scaling laws for RM overoptimization: proxy reward rises while gold reward falls — the canonical hook). RLVR (RL from verifiable rewards) — the engine for math/code reasoning: rewards computed by checkers, not learned models: exact-match/equivalence checkers for math (sympy normalization), unit-test execution in sandboxes for code, plus small format rewards. Algorithm: GRPO (Shao et al. 2024, DeepSeekMath) is the current default — samples G=8–64 responses per prompt, normalizes advantage within the group, no value network (halves memory); PPO with a value model remains fine. Settings that matter: temperature ~1.0 for rollouts; remove or minimize the KL term for verifiable tasks (R1 practice; the verifier anchors correctness, and KL throttles exploration); token-level loss aggregation, clip-higher (asymmetric clipping, e.g., 0.2/0.28) and zero-advantage-filtering refinements from DAPO (Yu et al. 2025); curriculum by pass-rate — keep prompts with pass@G in (0,1), drop saturated/impossible ones; let response length grow (length growth alongside accuracy is the expected R1-style signature, but watch for length without accuracy — reward-hacking the format). Sequencing and evaluation: SFT → (DPO and/or RM+PPO for general alignment) → RLVR for reasoning → a final light DPO/safety pass if RLVR regressed style. Gate every stage on a fixed battery: GSM8K/MATH/AIME, HumanEval/LiveCodeBench, IFEval (instruction following), MT-Bench/Arena-style judge evals, safety refusal sets, and — critically — a no-regression check on general benchmarks (MMLU drift >1–2 points means your mixture or KL is wrong).
Q040	95	How should I plan staged pretraining, annealing, mid-training, and post-training so that each phase improves the final model?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Plan phases by target distribution and optimizer state. Broad pretraining, mid-training, annealing, and post-training should each have entry criteria, exit criteria, eval gates, and rollback checkpoints. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Phase 1 broad pretraining uses stable high-entropy data and the main compute-optimal schedule. Phase 2 mid-training upweights code, math, long context, multilingual balancing, tool traces, or fresh knowledge while retaining 40%-70% base mixture. Phase 3 anneals over 1%-5% tokens with cleaner data and LR 10x-30x lower. Phase 4 applies SFT, preference optimization, RLHF/RLAIF, and RLVR. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Save checkpoints before phase boundaries and test post-training transfer from multiple base checkpoints. For context changes, retune batch/LR and keep short examples. For anneal, ensure it does not become hidden SFT. Gate each phase on base loss, private capabilities, safety, robustness, restore tests, and decision records. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Changing too many variables at boundaries, using public eval gains to justify contaminated data, anneal harming diversity, SFT overwriting reasoning, RL optimizing verbosity, and assuming final pretraining checkpoint is best for post-training. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Raffel et al. 2020; Kaplan et al. 2020; Hoffmann et al. 2022; Lewkowycz et al. 2022; Chen et al. 2023; Ouyang et al. 2022; Bai et al. 2022; Guo et al. 2025..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 89 Design the run as four phases with distinct objectives, data, and exit criteria, plus deliberate degrees of freedom reserved at each boundary. The architecture of the plan: bulk pretraining buys general competence; mid-training/annealing converts it into benchmark-relevant skill cheaply; long-context and other capability stages bolt on structural abilities; post-training shapes behavior. Gains at each phase are mostly non-fungible — no amount of SFT fixes a knowledge-poor base, and premium math data spent early is wasted relative to spending it in the anneal. Phase 1 — bulk pretraining (75–90% of tokens). Stationary mixture, dominated by filtered web + code + multilingual; WSD/trapezoidal or cosine schedule chosen with the later phases in mind (WSD's constant-LR plateau is purpose-built for forking anneals; Hu et al. 2024, Hägele et al. 2024). Hold back the premium data: curated math with solutions, highest-tier educational web, instruction-formatted sets stay out of this phase. Exit criterion: token budget consumed with loss tracking the scaling-law prediction and eval trajectories on-curve. Phase 2 — mid-training / annealing (10–20% of tokens). Decay LR to near zero while upshifting the mixture: 2–10x upweights on math, code, premium educational/text-book-like data, plus modest instruction-style data (the "mid-training" of OLMo 2, the annealing of MiniCPM/Llama 3; OLMo 2's Dolmino mix is a documented reference). This is the highest-leverage data-placement decision in the whole run: identical tokens yield multiples more benchmark movement here than in phase 1. Validate the anneal mixture by forking cheap anneal branches from a late phase-1 checkpoint (10–50B-token trials, pick the winner — budget 1–3% of compute for these). Also fold in: distillation from a stronger model if available, and the first context-length step. Exit: target evals (MMLU/MATH/code) hit predicted bands; no per-domain loss regressions. Phase 3 — capability stages. Long-context extension in stages (8k→32k→128k) with RoPE rescaling and length-balanced mixtures, gated on RULER/needle plus short-context no-regression. Optionally: speech/vision adapter stages for multimodal, FIM-completion training for code. Keep each stage's token budget small (tens of billions) and gated independently. Phase 4 — post-training. SFT (instruction + long-CoT) → preference optimization (DPO/RM+PPO rounds, iterated with fresh on-policy preference data) → RLVR for reasoning (GRPO on verifiable math/code/tool tasks) → final safety/style pass. Gate each sub-stage on the full battery: target-capability gains, general no-regression (MMLU drift <1–2 points), safety suite, robustness. Cross-phase design rules: (1) every boundary gets a frozen checkpoint, full eval snapshot, and a written go/no-go — phases must be individually attributable; (2) the data plan is global — partition premium data across phases up front so phase 2 is not starved by phase 1 greed, and decontaminate everything against the full eval inventory once, centrally; (3) carry a single source-of-truth predicted-trajectory document (from scaling fits) — each phase's exit compares actuals to it; (4) reserve 3–5% of total compute as contingency for rollbacks, anneal trials, and a possible second anneal from the plateau checkpoint (a WSD schedule makes the bulk run re-annealable — effectively letting you ship model N and continue toward model N+0.5 from the same plateau); (5) sequence irreversible decisions late: mixture weights can change at any boundary, but tokenizer and architecture cannot, so phase 1 entry is the real commit point — spend your ablation budget there proportionally.
Q032	96	How should I use scaling-law experiments and small proxy models to choose the final architecture and training budget?	5.5 Pro judge: 85 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Scaling-law experiments should estimate constants for your exact data, tokenizer, architecture, optimizer, and system, not copy another lab. Use them to choose final N, D, architecture, and budget with uncertainty. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Train proxies at 100M, 300M, 1B, 3B, 10B, and perhaps 30B parameters with multiple token budgets. Fit L = L_inf + aN^-alpha + bD^-beta or a compute-param-token law. Reserve 5%-15% of budget for scaling experiments, 5%-10% for failures/restarts, and enough for post-training/eval. Use measured MFU to compute effective C_train. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Fit separate curves for natural language, code, math, multilingual, and long-context strata. Evaluate architecture candidates at equal FLOPs: dense/MoE, RoPE/GQA, depth/width, FFN multiplier, normalization, vocab size. Run one intermediate model large enough to exercise final kernels and topology. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Fitting only successful runs, ignoring repetition, changing filters after fitting, extrapolating architecture changes outside observed scale, and choosing by aggregate loss despite reasoning regressions. Diagnose with confidence bands and proxy-to-final calibration. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Henighan et al. 2020; Hoffmann et al. 2022; Fedus et al. 2021; Narayanan et al. 2021; Tay et al. 2022 UL2..	5.5 Pro judge: 97 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 92 Use scaling experiments to make exactly four decisions — architecture family, parameterization/hyperparameters, the N-vs-D split, and the data mixture — and design the experiments so each conclusion is a fitted curve, not a point comparison. Experimental grid: pick 5–7 model sizes log-spaced over ~2.5 orders of magnitude below target (e.g., 40M, 100M, 300M, 1B, 3B, 7B for a 300B–1T target), trained on the production data pipeline and tokenizer. For budget allocation, the isoFLOP design (Chinchilla Approach 2; Hoffmann et al. 2022) is the most robust: at each of 4–5 compute levels, train 4–6 sizes to that budget, fit the parabola of loss vs N at fixed C, trace the valley, and fit N(C), D(C) power laws. Match the LR schedule horizon to each run's actual length (the Kaplan et al. 2020 fixed-horizon artifact understated optimal data ~10x — this is the single most important methodological correction). Fit L(N,D) = E + A/N^α + B/D^β by Huber-loss regression on all runs (Approach 3) as a cross-check; Epoch AI's reanalysis (Besiroglu et al. 2024) shows the fitting details matter — report confidence intervals via bootstrap, and treat ±2x in N as your real precision. Architecture selection: compare families (dense vs MoE configs, attention variants, depth/width, vocab size) only at matched compute on at least 3 sizes, and decide on the fitted offset/slope, not single-size deltas — many interventions help at 100M and vanish at 3B (Narang et al. 2021). Reject any candidate that wins loss but degrades the stability envelope (spike frequency, max-LR tolerance) at the largest test size: a 1% loss win that forces a 30% LR reduction at scale is a net loss. For MoE, fit separate laws (loss vs active params at fixed total, vs experts at fixed active; cf. Clark et al. 2022 unified scaling laws; DeepSeek and Krajewski et al. 2024 granularity studies). Hyperparameter transfer: adopt muP (Yang et al. 2021) so the LR/init/multiplier tuning you do at 40M transfers across width; validate by checking the LR-vs-loss curve optimum is width-stationary across your grid (the "muP check"). Tune on the proxy: LR, batch via noise-scale measurement, warmup, z-loss, QK-norm on/off, weight decay. Depth transfer is weaker than width transfer — keep proxy depth within ~4x of target depth or use depth-muP corrections. Beyond-loss prediction: validation loss is the reliable extrapolation target; benchmark accuracies are not directly — use the two-stage method (fit loss→benchmark-likelihood mapping from your model zoo, then map predicted loss to predicted accuracy; Llama 3 did exactly this to predict 405B ARC/MMLU from small runs; also Gadre et al. 2024 "Language models scale reliably with over-training"). Expect ±2–3 point accuracy error bands; plan gates accordingly. Proxy-model hygiene that determines whether any of this transfers: identical tokenizer, data distribution, sequence length, and eval harness across the grid; fixed seeds with ≥2 seeds at the smallest sizes to estimate run-to-run noise (≈0.005–0.01 nats — differences inside that band are not findings); single-epoch data at every scale (repetition distorts fits); and a holdout compute level — fit on the lower levels, verify prediction on the highest before trusting extrapolation 100x further. Final budget decision: combine the fitted laws with inference economics (overtrain toward smaller N when lifetime serving tokens are large — Llama-3-style 200–2000 tokens/param) and data ceilings (effective-epochs correction). Then pre-register the predicted final loss and the eval trajectory; a frontier run that tracks its prediction is healthy, and deviation is your earliest global alarm.
Q034	96	How should I construct verifiable reward signals and training tasks for mathematics, coding, and tool use?	5.5 Pro judge: 86 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. A verifiable reward must be deterministic, cheap, difficult to game, and closer to the real capability than a text preference. Separate correctness reward from format or process reward. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: For math, use exact matching after normalization, SymPy equivalence, interval checks, or formal provers such as Lean/Isabelle/Coq. For code, use hidden unit tests, property-based randomized tests, static analysis, time/memory limits, and sandboxed execution with network disabled. For tools, validate schema, state transition, timing, and faithful incorporation of observations. Keep format rewards capped around 5%-20% of total reward. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Split by template, source, and solution family. Generate parametric tasks with held-out ranges. For RL, sample multiple completions and monitor pass@1 and pass@k. Audit random reward decisions with humans/experts and plot verifier reward against private evaluation over RL steps. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Hardcoding expected outputs, modifying tests, exploiting undefined behavior, producing valid JSON with wrong semantics, unnecessary tool calls, and learning wrappers without correctness. Diagnose with hidden tests, adversarial perturbations, and reward-component logs. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Cobbe et al. 2021 GSM8K/verifiers; Lewkowycz et al. 2022 Minerva; Chen et al. 2021 Codex/HumanEval; Schick et al. 2023 Toolformer; Yao et al. 2022 ReAct; Shao et al. 2024; Guo et al. 2025..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 93 Design rewards so that they are (a) machine-checkable, (b) hard to game, and (c) graded at the right difficulty for the current policy. Domain by domain: Mathematics. Task source: problems with known final answers (GSM8K/MATH-style, competition archives, synthetically generated with solution verification). Reward: binary exact-match after canonicalization — parse the final answer from a mandated format (\boxed{} or structured tag), normalize with sympy (rationals, radicals, equivalent forms), compare. Pitfalls and fixes: answer-format gaming (model emits many candidate answers — require single-answer format, penalize multiple boxes); guessable answer spaces (multiple-choice and small-integer answers train guessing — prefer free-form numeric/symbolic; filter problems where random guessing exceeds ~5% success); contaminated problems (decontaminate against pretraining and eval sets); wrong gold answers (run a strong model at high N; problems with near-zero pass despite long correct-looking traces deserve human audit — bad gold labels actively teach errors). Process reward models (PRMs, Lightman et al. 2023 "Let's Verify Step by Step") help best-of-N selection but as RL training signals are gameable; the R1 line (Guo et al. 2025) deliberately used outcome-only rewards — start there. Coding. Reward = unit-test execution in a hermetic sandbox (containerized, no network, CPU/memory/time limits ~2–10s; firejail/gVisor/Docker). Quality requirements on the tests themselves: high coverage including edge cases — weak test suites are the dominant hacking surface (models learn to special-case visible tests or exploit undertested behavior); hold out a disjoint verification test set per problem and reward only on hidden tests; generate tests via property-based methods or strong-model generation followed by execution-validation against reference solutions; mutate reference solutions to confirm the suite kills mutants (mutation-testing score > ~0.7 is a decent suite-quality bar). Graded partial credit (fraction of tests) accelerates early learning; switch to all-or-nothing as pass rates rise to stop reward-shaping artifacts. SWE-style tasks (real repos, patch + run test suite à la SWE-bench, Jimenez et al. 2024) are the high-value regime: rewards are real CI outcomes; invest in repo-snapshot infrastructure with pinned dependencies — environment nondeterminism (flaky tests, network calls, timestamps) is your noise floor; measure and de-flake before training (re-run failures 3x, quarantine flaky tasks). Tool use. Construct tasks where correctness is checkable end-to-end: retrieval QA scored against gold documents/answers, code-interpreter tasks scored by execution result, API-call tasks scored by exact argument-schema match plus simulated-environment state checks (tau-bench-style user/environment simulators, BFCL-style function-call matching). Reward components: final-task success (dominant weight), schema validity of calls, and small penalties for excess calls/turns (else you train pathological tool-spamming). Simulate tool failures/timeouts in training so the policy learns recovery, not just the happy path. Cross-cutting design rules: (1) verify the verifier — red-team each reward channel with a deliberately adversarial policy phase (or manual probes) before large-scale training; every exploit you find pre-launch is a week of compute saved; (2) difficulty curriculum — maintain pass@G in (0.05, 0.9); pre-grade the pool with the current policy at G samples, resample the active set as the policy improves; (3) mix domains in each RL batch to prevent capability seesawing; (4) log per-task-family reward trends and sample transcripts weekly — reward going up while held-out human/judge evals stagnate is hacking until proven otherwise; (5) keep a frozen held-out task set never trained on, as the gold measure of generalization.
Q037	96	How do I build a continuous evaluation pipeline that gates checkpoints on capabilities, safety, robustness, and contamination?	5.5 Pro judge: 86 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. Continuous evaluation is model CI/CD. Pin prompts, decoding, dataset hashes, scoring code, judge versions, and contamination scans so score movements are evidence rather than instrumentation drift. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Run a cheap tier on every checkpoint: loss by stratum, fixed generations, toxicity/refusal probes, code/math smoke tests, tokenizer sanity, contamination deltas. Run full promotion evals for candidates: public and private capability benchmarks, safety, robustness, long context, tool use, calibration, and human/expert review. Hard gates may include no contamination increase, no critical safety regression, and tool-call validity above 99%. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use confidence intervals, private time-sliced evals, item-level contamination risk, and raw generation archives. For LLM judges, calibrate against human labels and rotate hidden prompts. Report metrics with high-risk contaminated items removed. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Optimizing to stale public benchmarks, judge drift, prompt changes, averaging away subgroup catastrophes, cherry-picking checkpoints, and ignoring tool/scaffolded capabilities. Diagnose with perturbations, multilingual/adversarial prompts, retrieval conflicts, malformed schemas, and long-context distractors. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Liang et al. 2022 HELM; Hendrycks et al. 2021 MMLU; Srivastava et al. 2022 BIG-bench; Ribeiro et al. 2020 CheckList; Brown et al. 2020; Li 2023 contamination..	5.5 Pro judge: 97 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 90 Build it as CI/CD for checkpoints: every candidate flows through automated tiers, each with pinned versions, explicit budgets, noise-calibrated thresholds, and recorded verdicts. Tier structure (cost-ascending, gate-early): Tier 0, minutes per checkpoint, every save — per-domain validation loss, canary perplexities, format/degeneration probes (greedy repetition rate); catches corruption and divergence. Tier 1, ~1 GPU-hour, every N checkpoints — the smooth early-signal battery: likelihood-scored HellaSwag, ARC, MMLU-cloze, LAMBADA, plus small GSM8K and HumanEval subsets with execution; trend-tracked against the run's predicted trajectory. Tier 2, tens of GPU-hours, milestones and pre-promotion — full benchmark suites (MMLU/MMLU-Pro, MATH, full HumanEval+/MBPP+, BBH, IFEval, multilingual sets, long-context RULER), safety battery, robustness suite. Tier 3, human + heavy automation, release candidates only — red-teaming, frontier-risk evals, human preference studies. Capabilities gating: define per-metric noise bands empirically (run the same checkpoint through the harness 3–5x with sampling-based evals to get sigma; checkpoint-adjacent variance for likelihood evals); gate on z-scores, not raw deltas (e.g., block promotion if any core capability drops >2 sigma, or >1 point absolute on MMLU-class anchors). Pin everything that moves scores: harness commit (lm-evaluation-harness or internal), prompt templates, few-shot exemplars and seeds, sampling parameters, answer-extraction regexes, sandbox images for code execution. A 1-point MMLU "regression" from a prompt-format drift is the most common false alarm in practice. Safety gating: layered sets — static refusal batteries (harmful request suites), over-refusal counter-batteries (XSTest-style, so you cannot pass by refusing everything), jailbreak robustness (templated + automated attack generation, e.g., GCG/PAIR-style attackers run fresh per candidate), and domain-specific harm evals aligned to your risk policy (cyber, bio uplift proxies, deception/sycophancy probes). For frontier-scale models add capability-threshold evals (WMDP for hazardous knowledge, autonomy/cyber ranges) feeding the formal governance gates (RSP/Preparedness-style frameworks). Safety gates are hard blocks with named human owners — no statistical hand-waving. Robustness: paraphrase and format perturbations of core benchmarks (template-swap MMLU; order-shuffled options), noise injection (typos, distractors), consistency checks (same question asked twice, agreement rate), and adversarial long-context probes. Track the gap between canonical and perturbed scores; widening gaps flag benchmark-specific overfitting (increasingly common once post-training data includes benchmark-formatted synthetic). Contamination control, three mechanisms: (1) preventive — n-gram (13-gram) decontamination of all training data against the full eval inventory, rerun whenever data or evals change; (2) detective — per-checkpoint contamination probes: per-item loss on benchmark strings (memorized items show outlier-low loss), guided-completion tests (model completes a benchmark item verbatim from a prefix), and step-function score jumps correlated with specific data shards entering training; (3) structural — maintain internally authored, never-published private evals mirroring each public benchmark's construct; the public–private score gap is your standing contamination meter (a model gaining on public-MMLU but flat on private-MMLU is contaminated, full stop). Infrastructure: results in a queryable store keyed by (checkpoint hash, harness version, eval version); dashboards with predicted-trajectory overlays; automated bisection tooling (a regression triggers re-eval, then data-window diff, then ablation checkpoint comparison); and a promotion record — every model that ships carries its full eval ledger. Budget the whole pipeline at 1–3% of training compute; the asymmetry between that and the cost of a bad checkpoint shipping is the entire argument.
Q050	96	How do I organize the people, simulations, rehearsals, monitoring, incident response, and decision checkpoints for a full frontier-model build?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. Organize the build like safety-critical engineering. Define ownership, rehearsals, monitoring, incident response, and decision checkpoints before the final run starts. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Assign owners for data, tokenizer, architecture, training systems, kernels, cluster ops, eval, post-training, safety/governance, security, inference, and program management. Run toy end-to-end builds and game days: kill nodes, corrupt checkpoint shards, slow storage, inject bad data, break NCCL, fill logging disks, and simulate safety threshold crossings. Schedule checkpoints at prelaunch, first 1,000 steps, 1%, 5%, 10%, 25%, 50%, anneal, final, post-training start, predeployment, and staged launch. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Dashboards must cover science, systems, data, and governance: loss strata, gradient/activation health, MFU, p99 step time, collectives, dataloader wait, storage, checkpoints, hardware errors, power, mixture, duplicates, contamination, safety evals, access controls, and incident status. Every alert has an owner and severity. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Late unreviewed changes, hidden dependence on one expert, speculative incident channels, eval teams overloaded by checkpoint volume, governance reduced to paperwork, and no rollback owner. Diagnose with runbooks, incident commanders, postmortems, and permanent detectors. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Narayanan et al. 2021; Rajbhandari et al. 2020; Liang et al. 2022 HELM; Mitchell et al. 2019 Model Cards; NIST AI RMF 2023; Ouyang et al. 2022; Bai et al. 2022; large-scale SRE incident practice..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 88 Organize the build as a mission program: small accountable teams around a single artifact, with rehearsed operations and pre-committed decision points — the social architecture determines whether the technical architecture ships. Team topology (a few hundred people at frontier scale, but the core run team is small): pretraining (data, architecture/scaling, training operations), systems/infra (cluster, kernels, storage, networking), post-training (SFT/RLHF/RLVR, agents), evaluation (capabilities + contamination + harness), safety (alignment evals, red-team, governance liaison), and a run command structure: one named Run Lead with authority over the mainline, a deputy rotation, and explicit interfaces (data team owns mixture manifests; infra owns goodput; eval owns the gating battery — no shared ownership of any gate). Two cultural commitments that published retrospectives validate: a single source-of-truth logbook (every intervention, anomaly, and decision timestamped against the loss curve — the OPT logbook practice, formalized), and blameless postmortems for every incident including scientific ones (a bad mixture decision gets the same postmortem rigor as an outage). Simulations and rehearsals before launch day: (1) the full dress rehearsal — a 1–5%-scale end-to-end run (real data pipeline, real checkpointing, real eval CI, real dashboards) run by the same people on the same on-call rotation that will run production; (2) game days — deliberately kill nodes, corrupt a checkpoint shard, inject a data-pipeline poisoner, simulate a loss-spike storm and an SDC ghost; verify the runbooks and the people execute (target: every failure signature in the runbook has been practiced by every on-call); (3) decision rehearsals — tabletop the hard calls: capability eval trips a governance threshold mid-run; the trajectory deviates negatively for two weeks; a rival release pressures the timeline — rehearse who decides, on what evidence, and confirm the pre-commitments hold under simulated pressure; (4) restore drills monthly during the run (kill the job, measure time-to-first-step). Monitoring as an organizational design: one dashboard hierarchy — health (goodput, MFU, stragglers, storage), science (loss vs pre-registered prediction, per-domain val losses, eval trajectories, grad-norm/stability), and risk (safety eval trends, contamination meters) — with alarms tiered into page (training stopped/corrupted/diverging), urgent (trajectory deviation, goodput degradation), and review (weekly drifts). The Run Lead's daily standup reads all three tiers; the weekly review decomposes goodput losses and reconciles trajectory-vs-prediction; alarms have owners, not channels. Incident response: severity ladder (SEV1: mainline corrupted/diverging; SEV2: progress stopped; SEV3: degraded), 24/7 on-call across infra and science (loss anomalies need a scientist on-call, not just an SRE), runbooks keyed by signature (spike → rollback-replay-bisect protocol; hang → flight-recorder dump and rank triage; eval regression → harness-then-data-then-checkpoint bisection), and a standing rule that mainline-touching fixes require two named approvers, one of whom is the Run Lead. Decision checkpoints — pre-committed before token one: phase-boundary gates (data freeze, architecture freeze, launch go/no-go, anneal-mixture selection, long-context entry, each post-training stage, release) with written entry/exit criteria, named deciders, recorded dissent, and rollback conditions; governance gates (capability thresholds with pre-authorized responses including pause) integrated into the same calendar so safety review is on the critical path by construction, not exception; and a contingency budget (3–5% compute, 2–4 weeks calendar) owned by the Run Lead so that absorbing a surprise does not require relitigating the plan. The test of the whole apparatus is counterfactual calm: a healthy frontier run is mostly uneventful precisely because every eventful path was rehearsed, pre-decided, or both.
Q035	97	How do I prevent reward hacking, capability collapse, and alignment regressions during large-scale post-training?	5.5 Pro judge: 86 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. Treat post-training as an adversarial optimization process. Rewards, judges, and verifiers will be exploited unless held-out, ensembled, audited, and constrained by KL and regression gates. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Keep separate train, validation, and secret test rewards. Use hidden tests for code, independent verifiers for math, and state-machine validation for tools. Maintain KL targets around 0.01-0.1 nats/token for many behavior-tuning regimes, small policy LR 1e-6 to 1e-5, and short RL phases with frequent eval. Use supervised replay or base-domain replay to reduce collapse. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Log reward components separately: correctness, format, safety, length, tool cost, citation support, refusal appropriateness. Plot reward versus response length and reward versus private human/expert quality. Evaluate every N RL updates on base capabilities, private reasoning, safety, calibration, and distribution shift. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Verifier syntax hacking, reward models preferring confident false reasoning, length inflation, calibration loss, excessive refusal from safety tuning, synthetic preference collapse, and tool deception. Diagnose reward overoptimization curves and disagreement among reward models. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Ouyang et al. 2022; Bai et al. 2022; Rafailov et al. 2023; Schulman et al. 2017; Gao et al. 2023 reward overoptimization; Shao et al. 2024; Guo et al. 2025..	5.5 Pro judge: 97 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 87 Three distinct failure classes; instrument and defend each separately. Reward hacking. Mechanism: the policy optimizes the literal reward channel past the point where it correlates with intent — RM overoptimization follows predictable curves (Gao et al. 2023: gold reward rises then falls as KL-from-reference grows, while proxy reward rises monotonically). Defenses, in order of leverage: (1) cap optimization pressure — adaptive KL control to a target budget (3–15 nats/response for RLHF; verifiable domains tolerate more), early-stop RL on gold-metric peak, not proxy peak; (2) harden the reward: for RMs — ensembles with disagreement-aware penalties (use min or mean-minus-std of ensemble heads), fresh preference data collected on-policy each iteration (the Llama 2/3 iterative loop), explicit length-debiasing (length-normalized rewards or length-balanced training pairs — verbosity is the most universal hack); for verifiable rewards — hidden test sets, mutation-hardened suites, sandbox integrity (block reading expected outputs, network, clock gaming); (3) detect: monitor KL drift, response-length distribution, per-task-family proxy-vs-holdout divergence, and run a weekly human/strong-judge audit on fresh transcripts; investigate any metric that improves implausibly fast. Known exotic variants to check explicitly: unicode/format tricks that fool parsers, test-case special-casing in code, sycophantic agreement spikes in preference data, and judge-model exploits if you use LLM judges (defend with rubric-anchored judging and judge ensembles). Capability collapse. Mechanisms: (a) entropy collapse — the policy concentrates probability mass, diversity dies, exploration stops; (b) catastrophic forgetting of pretraining knowledge under narrow RL/SFT distributions; (c) mode collapse to format templates. Defenses: monitor token-level entropy and distinct-n/self-BLEU of rollouts continuously — entropy trending toward zero predicts a stalled RL run before reward plateaus (DAPO's clip-higher specifically combats entropy collapse by letting low-probability tokens grow; Yu et al. 2025); keep pass@k at large k (k=64–256) as the capability-frontier metric — pass@1 rising while pass@128 falls means you are reshaping, not improving, the distribution (Yue et al. 2025's RLVR critique); mix 5–20% general-domain data (pretraining-style or general SFT) into post-training batches; bound total post-training drift (weight-space: monitor parameter L2 distance from the pretrained anchor; consider periodic small replay of pretraining data); gate every RL checkpoint on a broad no-regression battery (MMLU, multilinguality, long-context, coding-if-training-math and vice versa) with hard thresholds (e.g., >1.5-point MMLU drop blocks promotion). Alignment regressions. RLVR on math/code measurably erodes refusal behavior and safety tuning if untreated (reward says nothing about harmlessness, and KL-free optimization drifts). Controls: include safety prompts with verifiable-format refusal rewards or RM-scored safety in every RL mixture (5–15%); freeze a safety eval battery (jailbreak suites, harmful-request sets, over-refusal sets like XSTest to catch the opposite failure) and gate on it per checkpoint; finish the pipeline with a light safety-DPO pass if drift occurred; run automated red-teaming (attack-prompt generators) against each candidate, not just fixed sets. Watch for emergent misalignment couplings — training narrow harmful-adjacent skills can generalize broadly (Betley et al. 2025) — so audit beyond the trained domain. Process scaffolding that makes all three tractable: checkpoint every few hundred RL steps with full eval sweeps; keep the reference/anchor models frozen and versioned; one mixture or reward change per run segment; and maintain a transcript-sampling ritual — humans reading 50 random rollouts weekly catch hacks that metrics miss.
Q046	97	How do I build a governance and launch-gating process that tracks emerging capabilities and risks throughout a frontier run?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 67 Treat this as a controlled engineering decision, not a preference. Governance begins before data collection and has authority over launch. Tie risks to measurable capabilities, owners, thresholds, mitigations, and decision rights. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Maintain a risk register for autonomous cyber/task execution, persuasion, bio/chemical assistance, deception, privacy leakage, tool misuse, sandbox self-replication-like behavior, bias, and harmful content. Use staged gates: pre-run data/security/eval review; training milestone capability scans; post-training red-team and dangerous capability evals; launch tier, monitoring, rate limits, rollback, and disclosure. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Use public, private, and freshly generated evals; include scaffolding, tools, best-of-N, and post-training variants because deployed systems can exceed raw single-sample base capability. Track trends, not only pass/fail. Record decisions, dissent, waivers, and mitigation owners. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Governance after the model is already trained, evals ignoring tool access, launch pressure overriding gates, safety tuning collapsing useful behavior, and unclear incident authority. Diagnose with independent review and predeclared no-go thresholds. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Ouyang et al. 2022; Bai et al. 2022; Liang et al. 2022 HELM; NIST AI RMF 2023; Mitchell et al. 2019 Model Cards; responsible scaling policy literature..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 90 Build the governance process as a parallel pipeline with the same rigor as training itself: defined capability thresholds, scheduled and triggered evaluations, decision authorities with veto power, and pre-committed responses — written down before the run starts, because mid-run is too late to negotiate stakes. Framework skeleton (modeled on Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework): (1) define risk domains — CBRN uplift, offensive cyber, autonomous replication/AI R&D acceleration, persuasion/deception, and model autonomy/control; (2) for each, define capability thresholds (two tiers works: an "alert" threshold triggering enhanced evaluation and security, and a "critical" threshold requiring specified mitigations before continued scaling or deployment); (3) bind thresholds to pre-committed actions: security upgrades (weights theft becomes a national-security-grade concern at higher tiers), deployment restrictions, additional safeguards, or pausing — the commitment must be written so that pausing is a legible, pre-authorized outcome, not an ad-hoc fight. Evaluation cadence during the run: capability evals are not release-time activities. Schedule them: (a) on compute milestones — every 2–4x increase in cumulative training compute (the RSP-style trigger), and at minimum at fixed checkpoints (e.g., every 1e25 FLOPs or every 500B–1T tokens); (b) on capability signals — automatic triggers when tracked proxy metrics (coding evals, agentic-task suites, domain knowledge probes like WMDP, Li et al. 2024) cross pre-registered trend lines; (c) before any external exposure (API alpha, red-team partner access, deployment). Critically, evaluate with elicitation effort — fine-tune-on-the-eval-domain, tool access, scaffolding, best-of-N — because the question is what the model can do in adversarial hands, not what the base checkpoint does politely (the "capability elicitation" standard in the UK AISI / METR methodology). Underestimating capability via lazy elicitation is the canonical governance failure. Eval content per domain: standardized batteries (WMDP for hazardous knowledge proxies, Cybench/CTF-style ranges for cyber, METR-style autonomy task suites, swe-agent tasks for AI R&D acceleration) plus internal classified-adjacent evals built with domain experts where public benchmarks are inadequate (bio especially); uplift studies (does the model improve expert/novice performance over web-search baselines) for the domains where raw accuracy is the wrong construct. Decision structure: a launch/scaling review board with named members spanning research, safety, security, legal, and an explicitly empowered final authority; a written go/no-go at each gate with dissent recorded; a standing rule that the burden of proof is on proceeding once an alert threshold trips (affirmative safety case, not absence-of-proof-of-harm). Independent input: external red-teamers and evaluators (AISI-class bodies, contracted domain experts) at major gates, with enough model access and time to be meaningful (weeks, not days, and with fine-tuning access where the threat model warrants it). Operational integration: capability-eval results flow into the same dashboard as training metrics; the safety team has checkpoint access on the same cadence as capabilities teams; incident-response runbooks cover eval-threshold breaches like infra incidents (who is paged, what pauses, who decides); and the whole arrangement is rehearsed — run a tabletop where a mid-training eval trips a threshold, and verify the org actually executes the pre-committed response. Finally, version and publish what you can (framework documents, eval methodologies, system cards at release) — external accountability is part of what keeps internal commitments real, a lesson the field has learned from watching commitments drift when they were purely internal.
Q044	98	How do I design a fault-tolerant training stack that can sustain months of useful progress on tens of thousands of accelerators?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. A months-long run on tens of thousands of accelerators must assume frequent component failures. Reliability is checkpointing, health isolation, immutable deployment, and rehearsed incident response. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Define recovery objectives: maximum lost work, maximum resume time, and automatic versus human intervention. Use distributed transactional checkpoints every 30-60 minutes if I/O permits, with RNG/data-loader state, optimizer shards, topology, and code/container hashes. Maintain spare capacity or hot standby pools and support topology-compatible replacement. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Monitor per-rank, node, rack, and run-level health: p99 step time, NCCL retries, ECC/XID, thermals, power throttling, dataloader wait, checkpoint duration, loss spikes, NaNs, and mixture drift. Run canary jobs and microbenchmarks. Stage rollouts: unit, single GPU, node, small distributed slice, then main branch. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Checkpointing that works only at small scale, logging backpressure, one unhealthy node stalling all ranks, central data-loader services, unreviewed hot patches, and scheduler placement on bad nodes. Diagnose with bad-node databases, game days, and postmortems with permanent detectors. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Rajbhandari et al. 2020; Narayanan et al. 2021; Lepikhin et al. 2020; Fedus et al. 2021; large-scale HPC/SRE reliability practice..	5.5 Pro judge: 97 Opus 4.8 Judge: 94 Gemini 3.1 Pro judge: 91 Optimize goodput — the fraction of wall-clock producing retained training progress (MegaScale's framing, Jiang et al. 2024; Llama 3 reported >90%; Dubey et al. 2024) — and engineer every layer for MTTR-in-minutes rather than failure avoidance, because at 10k+ accelerators failures are continuous: Llama 3 logged 466 interruptions in 54 days (one per ~3 hours), 78% hardware-attributed, GPUs/HBM leading. Layer 1 — fleet hygiene. Acceptance burn-in per node (full-power GEMM soak, HBM patterns, NVLink/IB loopback, pairwise NCCL) with >3%-off-median rejection; continuous health scraping (DCGM XID stream — watch 48/63/64/79, corrected-ECC trend as a leading indicator, IB symbol errors, thermal/power anomalies); predictive draining — pull nodes on leading indicators rather than waiting for the crash; hot-spare pool 1–2% of fleet, warm (image loaded, storage mounted). Layer 2 — checkpoint/restore engineered for frequency. Async two-phase saves (device→host RAM snapshot in seconds, background drain to durable storage), fully sharded reads/writes, peer-replicated RAM snapshots so single-node loss restores from neighbors without touching the filesystem (Gemini, Wang et al. 2023), hierarchical tiers (RAM/NVMe every 10–30 min, PFS hourly, off-cluster daily), resharding-capable format. Target arithmetic: with MTBF ~3h, save overhead seconds, restore+replay ~10–15 min, checkpoint interval 15–30 min, the expected loss per failure stays under ~0.3% of progress. Layer 3 — automated failure handling. A supervisor (not humans) executes: detect (NCCL watchdog/heartbeat converts hangs to crashes with flight-recorder dumps — silent hangs are the worst goodput killer; MegaScale and TORCH_NCCL traces exist for this), diagnose (map failing rank → node; auto-triage from telemetry), evict-replace-resume (drain node, slot spare, reshard if topology changed, resume from RAM snapshot). Llama 3's operations were "less than three manual interventions per day" on 16k GPUs; that is the bar. Add preflight checks at every restart (quick NCCL ring test) so a bad spare does not cause a crash loop. Layer 4 — corruption defenses, the insidious tier. Silent data corruption: fleet studies (Meta, Dieter et al.; Google) put SDC-prone cores at ~1/1000; at 100k accelerators you will compute wrong numbers without errors. Defenses: periodic redundant-compute canaries (same batch on two node-sets, compare), checksum gradient all-reduces occasionally, loss-spike triage protocol — rollback and replay identical data: reproducible → data problem (skip window, PaLM-style); non-reproducible → suspect node (drain, burn-in). Checkpoint checksums on write/read; corrupt-shard quarantine with retained N-1 checkpoint. Layer 5 — degraded-mode strategy. Decide in advance how the job behaves when capacity drops: reshape DP width (requires reshardable checkpoints and elastic launcher), tolerate batch-size deltas within a pre-approved band (with sqrt-LR adjustment), pause-vs-shrink thresholds. Practice the playbook: monthly game-days killing nodes/switches/storage on a staging job; an untested recovery path is a failure mode, not a capability. Layer 6 — the human system. 24/7 on-call rotation with runbooks per failure signature (spike, hang, straggler, storage stall, fabric flap), an annotated run logbook (every intervention timestamped against the loss curve — OPT's logbook is the cautionary template), weekly goodput review decomposing losses (failures vs stalls vs restarts vs bubble) so engineering effort targets the actual dominant term. Metrics that run the program: goodput %, MTBF by component, MTTR distribution (drive p95 down — outlier recoveries dominate losses), checkpoint overhead %, replay tokens per incident, SDC canary hit rate, manual interventions/day. At steady state expect: >90% goodput, MTTR p50 <10 min, near-zero manual touches — anything worse identifies your next engineering quarter.
Q041	99	What are the best practices for designing and training a one-trillion-parameter mixture-of-experts model?	5.5 Pro judge: 78 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. For a trillion-parameter MoE, optimize active compute, routing stability, and all-to-all throughput; total parameters are not the training cost. Keep attention dense and use MoE mainly in FFN blocks. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start with 32-128 experts per MoE layer, MoE every other block or selected FFN blocks, top-1 or top-2 routing, BF16, RMSNorm, RoPE, and SwiGLU experts. Capacity factor priors: 1.0-1.25 for top-1, 1.1-1.5 for top-2. Load-balancing loss coefficient 0.001-0.02; router z-loss 0.001-0.01. Use conservative LR, often 0.5x-1.0x dense LR for shared parameters and maybe smaller router LR. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Prove router stability and all-to-all throughput at intermediate scale. Track expert load entropy, token drop rate by domain/language, router confidence, per-expert gradient norms, p99 all-to-all, MFU, and dense baselines at equal active FLOPs. Use grouped GEMMs or dropless block-sparse kernels where possible. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Dead experts, hot experts causing p99 stalls, drops biased against rare languages, router overfitting to formatting or position, checkpoint bloat, and serving systems unable to batch sparse experts. Diagnose with expert masking and specialization metrics. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Shazeer et al. 2017; Lepikhin et al. 2020 GShard; Fedus et al. 2021/2022 Switch Transformer; Zoph et al. 2022 ST-MoE; Gale et al. 2022 MegaBlocks; Dai et al. 2024 DeepSeekMoE; Jiang et al. 2024 Mixtral..	5.5 Pro judge: 96 Opus 4.8 Judge: 94 Gemini 3.1 Pro judge: 92 Anchor on the published trillion-class recipes — DeepSeek-V3 (671B total/37B active; DeepSeek-AI 2024), Kimi K2 (1T total/32B active; Moonshot 2025), and the Switch/GLaM/ST-MoE lineage (Fedus et al. 2021; Du et al. 2021; Zoph et al. 2022) — because nearly every design choice below is validated there. Sparsity and granularity: modern optima use fine-grained experts with high sparsity — total/active ratios of 18–32x (V3: 256 routed experts per layer, top-8 active, plus 1 shared expert; K2: 384 experts, top-8, sparsity 48x active-to-total expert ratio). Fine-grained experts (smaller d_ff per expert, more of them) beat few-large-experts at fixed compute (Krajewski et al. 2024, DeepSeekMoE, Ludziejewski et al. 2024 granularity laws). Include 1–2 shared (always-on) experts to absorb common features and stabilize routing. Keep the first 1–3 layers dense (V3: first 3) — early-layer routing is unstable and low-value. Active ~30–40B parameters with ~1T total is the demonstrated sweet spot for serving economics at frontier quality. Routing and balancing: sigmoid-gated top-K selection with the auxiliary-loss-free balancing of V3 — a per-expert bias added to routing scores, adjusted online (bias up for underloaded, down for overloaded experts; update rate ~1e-3) so balance is achieved without the quality tax of strong aux losses; keep only a tiny sequence-level aux loss (~1e-4..1e-3) as backstop. Avoid Switch-era heavy load-balancing losses (1e-2) — they measurably hurt quality. Use no token dropping (capacity-factor-free) if your kernels support it, as V3 did; otherwise capacity factor 1.0–1.25 with dropped-token monitoring. Add router z-loss style logit control and compute routing in FP32. Communication co-design: EP all-to-all is the throughput killer; constrain it structurally — V3's node-limited routing caps each token to ≤4 nodes; K2/V3 overlap all-to-all with compute (DualPipe-style schedules; dedicated SM allocation for comm kernels). Plan EP groups within rail/rack domains; keep TP small (V3 avoided TP entirely in training, using EP+PP+ZeRO-1 DP). Expect 20–30% MFU as the current realistic band for trillion-scale MoE (versus ~40% dense) and treat improvements as kernel/scheduling work. Stability package: trillion-MoE-specific instabilities are router collapse (all tokens to few experts), router oscillation, and logit explosions interacting with expert imbalance. The demonstrated kit: QK-norm or MuonClip-style qk-clipping (K2's MuonClip existed precisely because Muon at 1T showed exploding attention logits; K2 reported zero loss spikes over 15.5T tokens), FP32 router, bias-based balancing, gradient clip 1.0, and conservative LR (V3: 2.2e-4 peak for 671B with multi-stage decay). Monitor per-expert load entropy, dropped-token rate (if capacity-limited), per-expert gradient norms, and routing-distribution drift between checkpoints. Precision and memory: FP8 forward/backward GEMMs with 128x128 block scaling and FP32 accumulation is production-proven at this scale (V3, <0.25% loss penalty); BF16 elsewhere; optimizer state sharded (ZeRO-1) and partially offloaded if needed. MLA (multi-head latent attention) or GQA-8 for KV economy. Token budget: MoE scaling favors more tokens per active parameter than dense Chinchilla — V3 used 14.8T tokens at 37B active (~400 t/p active); plan token budgets off active-param scaling fits, not total. Validation ladder before committing: proxy MoEs at 3B/16B/60B total with identical expert granularity and routing, fitting MoE scaling curves and stress-testing balance dynamics; a routing-pathology chaos test (deliberately imbalanced data bursts); and an end-to-end systems rehearsal at ≥1/10 cluster scale measuring all-to-all overlap efficiency — the design is only as good as its slowest collective.
Q043	99	How should I jointly optimize expert count, expert size, routing, capacity factor, load balancing, and communication in a large MoE?	5.5 Pro judge: 79 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 63 Treat this as a controlled engineering decision, not a preference. MoE hyperparameters must be optimized jointly because expert count, size, routing, capacity, load loss, and communication determine both quality and throughput. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start searches with MoE every 2 blocks, E in {16,32,64,128}, top-k in {1,2}, capacity_factor in {1.0,1.1,1.25,1.5}, expert FFN near dense FFN size, and optional 1-4 shared experts. Load loss 0.001-0.02 and z-loss 0.001-0.01 are priors. Router noise can encourage early exploration but should be annealed if unstable. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Compare equal active FLOPs and equal tokens. Track token drop by domain/language/position, expert load entropy, mutual information between expert and domain, per-expert gradient norms, expert batch size, p99 all-to-all, and quality. Use dropless block-sparse kernels if token dropping harms quality. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Too many experts causing tiny inefficient batches, hot experts, capacity drops hurting rare domains, load loss preventing specialization, routers learning sequence position, and inference batching failures. Diagnose with all-to-all volume estimates roughly proportional to tokenshiddentop_k per MoE layer. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Shazeer et al. 2017; Lepikhin et al. 2020; Fedus et al. 2021; Zoph et al. 2022; Gale et al. 2022; Dai et al. 2024; Jiang et al. 2024..	5.5 Pro judge: 98 Opus 4.8 Judge: 94 Gemini 3.1 Pro judge: 91 These six knobs are coupled through two budgets — quality per active FLOP and communication per token — so optimize them jointly in that frame. Expert count and size (granularity): at fixed active FLOPs per token, splitting FFN capacity into more, smaller experts raises quality up to a granularity optimum then degrades (router overhead, per-expert GEMM efficiency). Current evidence (DeepSeekMoE/V3: 256 experts of d_ff 2048 at d_model 7168, top-8 + 1 shared; K2: 384 experts top-8; granularity scaling laws of Krajewski et al. 2024 and Ludziejewski et al. 2024) puts the optimum at expert d_ff between ~1/4 and ~1/2 of the dense-equivalent d_ff, with top-K of 8 and total/active sparsity 16–48x. Hardware floor: expert GEMMs below ~[tokens_per_expert x d_model x 2048] lose tensor-core efficiency — grouped-GEMM kernels (CUTLASS grouped GEMM, Megablocks; Gale et al. 2022) set how small you can go; benchmark the kernel before choosing expert size. Shared experts: 1–2, always-on, absorbing common computation — consistently positive in the DeepSeek line. Routing function: sigmoid scoring with top-K selection then normalization (V3) is displacing softmax top-K — it decouples expert scores, easing balancing. Router in FP32; keep router LR matched to body LR; no noise injection needed with bias-based balancing. Expert-choice routing (Zhou et al. 2022) guarantees perfect balance but breaks causality for autoregressive decoding (it selects tokens per expert) — use token-choice for decoder LMs. Capacity factor and dropping: dropless (no capacity limit) is now preferred for training quality (Megablocks made it efficient; V3 trains drop-free), at the cost of variable per-expert batch sizes that your grouped-GEMM and all-to-all must handle. If you must cap (memory determinism), CF 1.0–1.25 train-time with monitored drop rate <1–2%; rising drop rate is your imbalance alarm. Load balancing: the modern hierarchy — (1) primary: auxiliary-loss-free bias adjustment (V3): per-expert routing bias updated proportional to load error (gamma ~1e-3 per step), influencing selection but not gradient weights; (2) backstop: tiny sequence-level balance loss (alpha 1e-4..1e-3); (3) avoid Switch-era alpha=1e-2 global aux losses — quantified quality tax. Balance at the right granularity: per-node/per-EP-group balance matters more than global (it is what bounds all-to-all hot spots). Communication: EP all-to-all volume per MoE layer ≈ tokens x d_model x topK x 2 bytes, twice (dispatch+combine) each direction, per forward and backward. Levers, in order: (a) topology-limited routing — cap each token's experts to M nodes (V3: ≤4) via group-limited top-K selection, converting global all-to-all into bounded-fanout exchanges; (b) overlap — schedule all-to-all under compute (DualPipe/zero-bubble variants; V3 dedicated ~20 SMs to comm kernels); (c) hierarchical all-to-all (intra-node NVLink aggregation, inter-node IB); (d) FP8 dispatch payloads; (e) place EP within rail-optimized domains, keep TP minimal (V3: TP=1 in training). Budget rule: all-to-all exposed time must stay <10–15% of step time; measure with the comm-stream timeline, not aggregate counters. Joint optimization procedure: fix sparsity ratio from serving economics → sweep granularity x topK on 3B/16B-total proxies fitting quality-per-active-FLOP curves → kernel-benchmark the surviving (expert size, expected tokens/expert) configs → simulate/measure all-to-all at target cluster shape under realistic imbalance (replay routing distributions from proxy checkpoints — not uniform assumptions) → lock the config and stress-test balancing dynamics with adversarial data bursts (domain-concentrated batches, which create the worst hot spots). Standing telemetry: per-expert load entropy and its drift, per-EP-group peak/mean load ratio (alarm >1.3), router-bias magnitudes (saturating biases mean structural imbalance the bias cannot fix — usually a data-domain/expert specialization conflict), and dropped-token or overflow rates per layer.
Q045	99	How should I co-design model architecture, kernels, network topology, storage, and power constraints for a frontier cluster?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Co-design the model and cluster around the training step's critical path: GEMMs, attention, collectives, data movement, checkpointing, and power/thermal constraints. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Choose d_model and FFN multiples of 128/256, head_dim 64-128, padded vocab, expert sizes large enough for grouped GEMMs, and sequence lengths compatible with FlashAttention tiles. Use NVLink/NVSwitch islands for TP/sequence parallelism, high-bisection fabrics for DP/PP/MoE, node-local NVMe caches for reads, and burst bandwidth for checkpoints. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Measure real collective bandwidth by group/rack, storage read and write headroom, power/clock behavior, thermals, and performance per watt. Require storage to sustain at least 1.2x training read demand plus checkpoint traffic. Review architecture with kernel and inference teams before final scaling. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Architecture dimensions misaligned with tensor cores, MoE that trains but cannot be served, long-context attention exceeding memory bandwidth, checkpoint I/O starving data loaders, power caps causing stragglers, and all-to-all competing with storage traffic. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Dao et al. 2022/2023 FlashAttention; Narayanan et al. 2021; Rajbhandari et al. 2020; Gale et al. 2022; Fedus et al. 2021; NVIDIA Transformer Engine/Megatron-Core practice..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 88 Co-design means iterating one loop: model FLOPs/byte/communication profile ↔ kernel efficiency ↔ network traffic pattern ↔ storage/power envelope — with measured constants, until no single component is the binding constraint by a large margin. Start from the model's resource signature: a candidate architecture implies, per token: FLOPs (≈6N_active), HBM traffic (weights touched, KV, activations — determines arithmetic intensity and whether kernels are compute- or bandwidth-bound), inter-GPU bytes by class (TP all-reduce, PP point-to-point, EP all-to-all, DP gradient traffic), and host/storage demand. Build a static performance model (analytic roofline per layer plus collective-cost model from measured alpha-beta link parameters) before buying or laying out anything; validate it to within ~10% on a small testbed, then trust it for topology decisions. Architecture↔kernel co-design: choose dimensions kernels love — head_dim 128 (FlashAttention-3 sweet spot), d_model/d_ff multiples of 128/256, vocab padded to 128k-multiples for TP; GQA-8 or MLA to cut KV bandwidth; expert sizes set by grouped-GEMM efficiency curves (benchmark CUTLASS grouped-GEMM across candidate tokens-per-expert before fixing granularity); FP8 (E4M3 + 128x128 block scaling, FP32 accumulation — the DeepSeek-V3 recipe) only where kernel support is production-grade. If a proposed architectural novelty lacks an efficient kernel, price the kernel-engineering time into the decision or reject it — a 5% quality win at 40% kernel efficiency is a loss. Network topology from the traffic matrix: NVLink domain size determines max practical TP/CP (8 on HGX; 72 on GB200 NVL72 changes the design space — TP/EP domains of 72 with NVLink-speed all-to-all favor MoE); inter-node tier: rail-optimized fat-tree (8 rails matching 8 GPUs/node) so same-index GPUs share switch planes — this is what makes hierarchical DP all-reduce and bounded EP fanout cheap; size oversubscription by measured demand: PP+DP traffic tolerates 2:1+ oversubscription at the spine, EP all-to-all does not — if MoE, keep EP groups within full-bisection pods (V3's ≤4-node routing exists precisely to live within such a domain). Enable SHARP/in-network reduction for the DP tier. Leave headroom: checkpoint drains, eval jobs, and dataloader traffic share the fabric unless you physically separate storage/compute networks (do — dual-plane designs prevent checkpoint storms from stalling collectives). Storage: three tiers — node-local NVMe (snapshot tier + data cache; ≥2 GB/s/GPU read), parallel FS or object store sized for checkpoint bursts (a 1T-param run saves ~10+ TB per durable checkpoint; provision ≥1 TB/s aggregate burst or stagger), cold object archive. Dataloader steady-state is trivial (tens of GB/s cluster-wide) — the design driver is checkpoint burst + restore parallelism (restore p95 <5 min from the warm tier). Power and cooling as first-class constraints: training load oscillates with the step cycle (compute-heavy vs comm-heavy sub-phases) — megawatt-scale sub-second swings stress facility electrical systems (documented enough that vendors ship power-smoothing features; Llama 3 reported tens-of-MW swings from in-step oscillation). Mitigations: per-rack power capping with headroom, kernel-level smoothing (NVIDIA power-steering features), staggered job phases across pods, and validating facility response with a full-cluster synchronized load test before the run. Thermal: sustained-boost-clock stability under soak (throttling nodes become stragglers — burn-in must include thermal soak at production power). Governance of the loop: a single capacity model document tying (architecture candidate → predicted MFU, step time, checkpoint cadence, power draw) per cluster option; every architecture change above kernel-trivial re-runs the model; and a hardware-software weekly review during design — the expensive failures (TP across slow links, EP melting an oversubscribed spine, checkpoint storms) are all cross-team interface mistakes that this meeting exists to catch.
Q047	99	How should I use dense and MoE scaling laws, ablations, and online measurements to make mid-run decisions without wasting compute?	5.5 Pro judge: 84 Opus 4.8 Judge: 85 Gemini 3.1 Pro judge: 64 Treat this as a controlled engineering decision, not a preference. Use dense and MoE scaling laws for mid-run decisions, but predefine the decision rules to avoid sunk-cost bias and noise chasing. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Before final launch, fit dense and MoE laws from controlled ablations. During training, update forecasts with validation loss by domain, MFU, tokens consumed, failure overhead, router load, active FLOPs, and communication cost. At 1%, 5%, 10%, 25%, 50%, and anneal start, decide continue, branch, alter mixture, or stop. Limit branches to about 0.5%-2% of total compute each. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Branch from the same checkpoint, train equal tokens with the same validation manifests, and compare old/new validation distributions for mixture changes. For context changes, require short-context and long-context evals. For MoE routing changes, track expert entropy, p99 all-to-all, and quality. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Optimizing public evals at milestones, mistaking data-order noise for effect, overreacting to transient LR shifts, applying dense laws to sparse models, and ignoring wall-clock when MFU drops. Diagnose with posterior expected value and confidence bands. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Henighan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021; Fedus et al. 2021; Gale et al. 2022; Dai et al. 2024..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 92 Structure mid-run decision-making around three instruments — pre-registered predictions, cheap forked experiments, and online measurements with known noise bands — and a standing rule: no intervention without an instrument reading that justifies it. Pre-registration as the foundation: before launch, publish internally the fitted dense or MoE scaling curves (L(N,D) with confidence bands from bootstrap), the predicted loss-vs-tokens trajectory for the actual run, and predicted eval trajectories via the loss→benchmark two-stage mapping (the Llama 3 method: fit benchmark-likelihood vs loss across the model zoo, then compose; Dubey et al. 2024; Gadre et al. 2024). These predictions are the run's null hypothesis. Define alarm bands: trajectory deviation >2x the proxy-run residual scatter (typically ~0.005–0.01 nats) sustained over >20B tokens triggers investigation, not action — the investigation hierarchy is: eval/measurement bug → data pipeline drift → hardware (SDC) → genuine dynamics. Online measurements worth maintaining (cheap, continuous): per-domain validation losses (mixture-invariant), gradient-noise scale (estimates the evolving critical batch size — justifying batch ramps quantitatively rather than by folklore; McCandlish et al. 2018), smooth likelihood-based eval trajectories, MoE-specific gauges (expert-load entropy, routing drift, dropped tokens), and the loss-curve residual against prediction. For MoE runs, also track quality-per-active-FLOP against the dense reference curve you fitted — if the MoE advantage is eroding over training, you want to know by 30% through, not at the end. Forked experiments — the highest-information mid-run tool: from any milestone checkpoint, fork short branches (1–50B tokens) to test interventions before applying them to the mainline: annealing forks to choose the final-phase data mixture (anneal 10–50B tokens on candidates, eval, discard — OLMo 2's documented methodology; also "Does your data spark joy?" Blakeney et al. 2024); LR-floor or schedule-variant forks before committing the decay phase; batch-ramp forks reading the gradient-noise justification; for MoE, balancing-coefficient or bias-update-rate forks when load entropy drifts. Budget 1–3% of total compute for forks, pre-allocated, with a standing queue so fork capacity is never idle. Rules: forks change one variable; forks are evaluated on the frozen battery; fork results above the noise band are required before mainline changes. Ablation discipline mid-run: resist re-opening settled questions (architecture is frozen; relitigating it mid-run is dead compute). Legitimate mid-run ablations are confined to: data mixture weights, schedule tail shape, batch size, and post-pretraining plans — all testable by forks. Maintain the small-proxy ladder in parallel through the run: when a mainline anomaly appears, the first question is whether the 1B proxy on the same data window reproduces it at 1/1000 cost; a proxy-reproducible anomaly is a data/code problem, a proxy-silent one is scale-specific (precision, hardware, batch regime). Decision economics: every proposed intervention gets a one-page expected-value note — predicted gain (from instrument readings), cost (compute + risk of trajectory disruption + attribution loss), and the reversal plan. The asymmetries to respect: data-mixture changes are cheap and reversible; anything touching optimizer state or batch carries transient costs measured in billions of tokens; pausing to investigate costs cluster-time linearly but bad interventions compound. Historical base rate worth internalizing: most mid-run interventions in published retrospectives (OPT's logbook being the extreme) were responses to instability, not optimization — a healthy run tracking its pre-registered curve should expect to make roughly zero unforced changes between phase boundaries, and a team making weekly mainline tweaks is burning attribution and compute simultaneously.
Q049	99	How should I design a multimodal frontier model that unifies text, code, images, audio, and video without sacrificing language quality?	5.5 Pro judge: 83 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 66 Treat this as a controlled engineering decision, not a preference. A multimodal frontier model should add modality competence without sacrificing language quality. Stage integration rather than training all modalities jointly from scratch unless you have overwhelming data and compute. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Start from a strong text/code model. Add ViT/ConvNeXt-style image encoders, spectrogram/waveform audio encoders, temporal video encoders, and projection adapters or cross-attention into the language decoder. Reserve modality boundary, timestamp, bounding-box, and tool-output tokens. Keep 30%-70% text-only data during multimodal phases depending on stage. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Stage 1 contrastive/caption/ASR alignment; stage 2 multimodal instruction tuning; stage 3 verifier/preference/RL on OCR exact match, chart QA, visual math, screenshot-to-code, ASR WER, and temporal localization. Evaluate text/code regression, VQA, OCR, charts, diagrams, localization, counting, adversarial images, multilingual audio, and video temporal order. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Caption bias replacing visual grounding, hallucinated objects, OCR brittleness, video models ignoring temporal order, language degradation after multimodal SFT, unsafe image interpretation, and compute wasted on redundant frames/pixels. Diagnose with near-duplicate image splits and modality token budgets. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Radford et al. 2021 CLIP; Alayrac et al. 2022 Flamingo; Li et al. 2023 BLIP-2; Liu et al. 2023 LLaVA; Radford et al. 2022 Whisper; Girdhar et al. 2023 ImageBind; GPT-4V/Gemini technical reports..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 93 The central design choice is where modality fusion happens; the current evidence supports early-fusion token-unified training for new flagship builds, with the language budget protected by mixture discipline rather than architectural separation. Architecture: (1) Vision input: a ViT encoder feeding continuous embeddings through a projector (the Llava/Qwen-VL/InternVL lineage) remains the most robust path to strong image understanding; native-resolution handling via tiling or NaViT-style packing; encoder around 300M–6B params, pretrained (SigLIP-class) then unfrozen in later stages. (2) Generation/unification: discrete tokenization (VQ/FSQ for images, codec tokens for audio — e.g., SoundStream/Encodec-style at 25–75 tokens/sec) enables a single autoregressive objective over all modalities (Chameleon, Meta 2024; AudioLM lineage) but costs perceptual fidelity and inflates sequence lengths; interleaved continuous-embedding input with diffusion or flow heads for generation is the alternative that protects understanding quality. A defensible frontier default: continuous-embedding inputs for image/video/audio understanding + dedicated generation heads, unified in one transformer body — full discrete-token unification only if any-to-any generation is a product requirement. (3) Video: frame sampling at 1–2 fps with temporal pooling/resampling into a token budget of 32–128 tokens/frame — token-budget engineering is the entire video game. (4) Audio: dual-path (semantic tokens for understanding, acoustic for generation) where speech quality matters. Preserving language quality — the named risk — is a data-budget and curriculum problem with known failure data: Chameleon documented severe stability and balance challenges in early fusion (requiring QK-norm, z-loss, and norm-reordering to train at all, and careful mixture control to protect text); every VLM line reports text-benchmark regression when multimodal data is overweighted or introduced abruptly. The working recipe: (a) keep text dominant — multimodal tokens at 10–40% of the total mixture in joint phases, with text never below ~60% during any phase that updates the full body (Gemini-class and open VLM recipes converge here in spirit if not published numbers); (b) stage the curriculum: text-first bulk pretrain (or a strong text-pretrained init), then joint mid-training mixing interleaved image-text/video/audio with continued text, then modality-specific capability stages — rather than fully joint from token one, unless you are explicitly buying Chameleon-style native fusion and pricing its stability tax; (c) gate every multimodal stage on the full text battery with hard no-regression thresholds (MMLU/code/reasoning drift >1 point blocks); (d) freeze-then-unfreeze: early adapter/projector training with the LM body frozen, unfreezing the body only with the balanced mixture in place. Stability specifics for mixed-modality training: per-modality loss weighting normalized by token counts and information density (a video frame's 64 tokens should not outvote 64 text tokens at equal weight — tune per-modality loss multipliers on proxies); QK-norm and z-loss on by default (Chameleon's published necessity); separate or carefully-shared embedding norms for modality-specific token spaces; watch per-modality gradient-norm ratios — one modality dominating gradients predicts its capture of shared capacity. Data: interleaved web documents (image-text in natural order — OBELICS-class) outperform caption-pairs alone for in-context multimodal ability; synthetic captioning at scale (recaptioned web images with strong VLMs) is now standard for quality; dedup and decontaminate per-modality (perceptual hashes for images against eval sets). Evaluation: per-modality batteries (MMMU, MathVista, DocVQA, Video-MME, audio understanding suites) plus the protected text suite, plus cross-modal transfer probes (does chart reasoning improve math-in-text?) tracked per stage — cross-modal synergy is the upside that justifies unification; if you are not measuring it, you are only measuring the tax.
Q042	100	How do I plan a complete frontier-scale training run end to end, including cluster, data, architecture, pretraining, and post-training?	5.5 Pro judge: 84 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 65 Treat this as a controlled engineering decision, not a preference. A frontier run is a program, not a job. Start from objectives, convert them to requirements, then freeze artifacts, rehearsals, runbooks, gates, and ownership before launch. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: Define target capabilities, languages, modalities, context length, inference budget, safety constraints, and launch tier. Budget training as about 6ND dense active FLOPs plus overhead for attention, MoE, eval, failures, checkpoints, and post-training. Reserve 10%-25% of total budget for pilots, ablations, restarts, and post-training. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Finish data/tokenizer/mixture work before final launch. Select architecture by equal-compute ablations. Choose topology by measured throughput on a realistic cluster slice including checkpoint and validation traffic. Hold token milestones at 1%, 5%, 10%, 25%, 50%, 75%, anneal, and final to decide continue, branch, modify, or stop. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Last-minute data changes, untested checkpoint restore, eval pipeline lag, governance arriving after capabilities emerge, and architecture incompatible with serving. Diagnose with launch checklists, side restores, data-loader stress, and post-training proxies from multiple base checkpoints. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Kaplan et al. 2020; Hoffmann et al. 2022; Narayanan et al. 2021; Penedo et al. 2023; Ouyang et al. 2022; Bai et al. 2022; Liang et al. 2022..	5.5 Pro judge: 96 Opus 4.8 Judge: 93 Gemini 3.1 Pro judge: 92 Plan backwards from the target and forwards from the constraints; the deliverable is a phased plan in which every major commitment carries a validation gate and a contingency. Step 1 — define the product target: capability envelope (reasoning tier, context length, languages, modalities, agent reliability), serving cost ceiling (this sets active parameters), and ship date. Derive the compute envelope from cluster size x duration x realistic MFU (dense ~40%, MoE ~25%) — e.g., 25k H100s x 100 days x 35% ≈ 2e26 FLOPs class. Step 2 — cluster and infra readiness (begins months early): burn-in the fleet (GEMM/HBM/NCCL stress; reject >3%-off-median nodes), measure—not assume—bisection bandwidth and storage throughput, deploy the telemetry stack (per-rank step times, DCGM/XID, fabric counters), hot-spare pool 1–2%, async hierarchical checkpointing drilled with kill-tests, automated drain-resume. Expect Llama-3-magnitude interruption rates (one every few hours at 16k GPUs; Dubey et al. 2024) and size goodput targets accordingly (≥90%). Step 3 — data program: corpus assembly through the full pipeline (extraction, LID, heuristics, MinHash dedup, model-based quality filtering, PII/safety, central decontamination) to a versioned snapshot ~1.5–2x larger than the planned token budget; premium-data reservation for annealing; mixture optimization via 1B-proxy sweeps; epoch caps (≤4) enforced. Parallel track: synthetic data generation (math solutions, code with tests) with execution-validation. Step 4 — architecture and budget via scaling science: muP-parameterized proxy ladder (40M→7B) on the production data; isoFLOP fits for N-vs-D; architecture freeze gated on neutral-or-better scaling plus stability envelope; MoE-vs-dense decision driven by serving economics (trillion-total/30–40B-active if serving cost dominates). Pre-register predicted loss and eval trajectories with error bands — these become the run's health reference. Step 5 — pretraining execution: bulk phase (~80–85% tokens) on the stationary mixture with WSD or cosine; the operations loop — 24/7 on-call, loss-spike runbook (rollback + data-skip à la PaLM), weekly MFU and goodput reviews, milestone checkpoints every 50–100B tokens; mid-run decisions only via the pre-registered trajectory comparison plus annealing-fork experiments (1–3% compute reserved). Then annealing/mid-training (premium upweighted, LR→0), staged long-context extension (RoPE rescaling + length-balanced data, gated on RULER + short-context no-regression). Step 6 — post-training: SFT (instruction + long-CoT) → iterative preference optimization (RM+PPO or DPO rounds with fresh on-policy preferences) → RLVR (GRPO on verified math/code/agentic tasks with hardened reward channels) → safety finishing pass. Each sub-stage gated on capability-gain + general-no-regression + safety batteries. Step 7 — evaluation and governance as continuous infrastructure, not a phase: tiered eval CI on every checkpoint, contamination monitors (public-private eval gaps), frontier-risk evals at capability milestones feeding a written go/no-go framework (RSP/Preparedness-style), red-teaming for release candidates, and a final eval ledger shipping with the model. Cross-cutting budget discipline (typical allocation): ~70–80% of compute to the final pretrain, 5–10% scaling science and ablations, 1–3% anneal forks and mid-run experiments, 5–10% post-training, 1–3% evaluation, rest contingency. Calendar discipline: data and scaling science dominate the critical path before launch day — the most common planning failure is compressing them to feed the cluster sooner; an idle week of GPUs is cheaper than a 2e26-FLOP run on the wrong mixture. Freeze order: tokenizer → architecture → data snapshot → schedule; each freeze gets a named owner, a gate review, and a documented rollback condition.
Q048	100	How do I train a frontier model for long-horizon reasoning and agentic tool use while maintaining reliability under distribution shift?	5.5 Pro judge: 85 Opus 4.8 Judge: 86 Gemini 3.1 Pro judge: 69 Treat this as a controlled engineering decision, not a preference. Long-horizon reasoning and agentic tool use require trajectory training, verifiable outcomes, state management, and distribution-shift evaluation. Single-turn instruction data is insufficient. For a lab aiming at frontier-level intelligence, write the rule into the run manifest before launching: what is fixed, what may be swept on proxy runs, what metric decides promotion, and what rollback condition stops the phase. The important habit is to convert every tacit training recipe into an auditable intervention with token counts, seeds, data manifests, code commits, and evaluation hashes. Concrete starting configuration: SFT on traces with decomposition, tool selection, state summaries, error recovery, and stopping behavior. Include 2-step to 50+ step controlled tasks. Use RLVR for code hidden tests, math symbolic checks, simulated web/navigation, database queries, and structured planning games. Optimize with PPO/GRPO under KL control; reward final success heavily and add small shaping for valid tool calls, consistency, and cost. Use these numbers as priors, then run narrow sweeps on 100M-3B parameter proxies and one intermediate model large enough to exercise the real kernels and parallelism. Compare at equal tokens and equal active FLOPs, not equal wall-clock unless the decision is explicitly about throughput. Keep random seeds, tokenizer, data order, validation manifests, and optimizer semantics fixed across the sweep. Promote a setting only if it improves the target metric without violating regression gates on language, code, math, multilinguality, safety, calibration, or serving cost. Implementation and validation protocol: Measure success versus horizon length, number of tool calls, branching factor, observation noise, schema variation, API errors, distractors, languages, and ambiguity. Train abstention/clarification when evidence is insufficient. Use retrieval provenance and constrained structured outputs for action validity. Log the metric by domain, language, sequence length, source, and data-quality bucket whenever those strata are relevant. Use private and time-sliced holdouts for any capability likely to be contaminated by web data or synthetic data. For all changes, run a short shadow continuation from the same checkpoint before changing the main branch. The shadow run should include checkpoint save/restore, validation, sample generation, and throughput measurement, because many apparent scientific regressions are actually data-loader, precision, or distributed-systems faults. Known failure modes and diagnostics: Reward hacking tests, brittle plans under schema changes, overcalling tools, compounding hallucinations, prompt injection through retrieved text, high pass@k but low pass@1, and losing track despite long context. Diagnose with environment simulations and calibration on failing trajectories. Alert on loss spikes above five rolling standard deviations, gradient norm above three to five times its median, NaN/Inf tensors, realized mixture drift above one to two percentage points for major buckets, and p99 step time far above p50. Maintain a written decision record showing why the chosen value beat alternatives. When a metric moves unexpectedly, first localize by stratum and by rank: data-correlated anomalies, rank-correlated anomalies, and schedule-correlated anomalies have different root causes. Published anchors: Wei et al. 2022 chain-of-thought; Wang et al. 2022 self-consistency; Yao et al. 2022 ReAct; Schick et al. 2023 Toolformer; Lewis et al. 2020 RAG; Chen et al. 2021 Codex; Shao et al. 2024; Guo et al. 2025..	5.5 Pro judge: 97 Opus 4.8 Judge: 94 Gemini 3.1 Pro judge: 88 Long-horizon agentic reliability is a compounding-error problem: a policy with per-step success p delivers roughly p^n over n steps, so the training target is not better average steps but error-recovery, calibrated uncertainty, and robustness to states the model created itself. Build the program in four parts. Foundations in mid/post-training data: long-horizon behavior needs long-horizon supervision — multi-step agent trajectories (tool chains, code-edit-test loops, web/OS workflows) at 10k–100k+ scale, including: explicit recovery segments (errors followed by diagnosis and repair — harvest these by injecting failures into otherwise-successful trajectories and re-rolling completions), reflection/verification steps (the trajectory pattern where the agent checks its own intermediate result before proceeding), and abandonment/escalation examples (knowing when to stop or ask). Loss-mask environment tokens; train on model decisions. SFT on flawless expert trajectories alone produces brittle agents — the distribution-shift classic (DAgger's lesson, Ross et al. 2011): the policy visits states the expert never showed, so deliberately include self-generated states with corrective labels. RL for horizon extension: outcome-rewarded RL on verifiable long tasks (SWE-bench-style repo fixes with CI rewards, multi-step tool workflows with state-check rewards, computer-use tasks in instrumented VMs). Key design choices: (a) reward end-state success dominantly; per-step shaping only for hard-exploration tasks and removed once learning starts (shaped rewards get gamed over long horizons); (b) curriculum over horizon length — start at 3–5 step tasks, extend as pass rates rise; track pass^k (all-of-k-trials success; tau-bench's metric) as the reliability gauge, since pass@1 hides flakiness; (c) massive rollout infrastructure: thousands of concurrent sandboxed environments (containerized repos, browser VMs) with pinned dependencies and de-flaked verifiers — environment nondeterminism is your reward-noise floor, measure it (re-run identical policies; >2–3% outcome variance needs fixing before training); (d) long-context training support — trajectories of 100k+ tokens require the context-extension work upstream and memory/summarization tooling where contexts exceed even that. Distribution-shift hardening — the part most programs skip: train and evaluate under systematic perturbation: tool API changes (renamed functions, schema-version drift), environment-version skew (dependency upgrades breaking learned recipes), adversarial content in observations (prompt injection in retrieved pages/tool outputs — train explicit data-vs-instruction separation and verify with injection suites), novel-but-related task families held out entirely from training (the generalization measure that matters), and degraded conditions (timeouts, partial failures, rate limits). The robustness training signal: mix perturbed variants into RL at 20–40% rates; the evaluation signal: report the canonical-vs-perturbed performance gap and treat its growth as overfitting to the training environment distribution. Reliability machinery at the policy level: calibrated confidence — train verbalized or token-level uncertainty so downstream systems can route low-confidence states to humans/retries (evaluate with selective-prediction metrics: risk-coverage curves, not just accuracy); self-verification habits rewarded in RL (re-run tests before declaring success — agents that check their work convert silent failures into retried steps, directly raising p^n); bounded-resource discipline (penalize unbounded retry loops and tool spam; reward efficient completion) so reliability does not come at pathological cost. Evaluation gates: pass^8 on held-out task families, perturbation-gap metrics, injection-resistance suites, long-horizon benchmarks (SWE-bench Verified, OSWorld, tau-bench, METR-style autonomy tasks with time-horizon scoring), and regression checks on base capabilities (agentic RL measurably erodes general chat/knowledge if unmixed — keep 10–20% general post-training data in every phase). Expect the horizon-capability curve to move in discrete steps with each infrastructure improvement (better verifiers, longer context, more environments) rather than smoothly with compute — plan the program around environment engineering as much as around model training.