arbdwj

Tracing Eval-Awareness Emergence Through Training of OLMo 3

Wed, 10 Jun 2026 12:00:00 +0000

A post on LessWrong tracing how a model’s awareness of being evaluated emerges across the training stages of OLMo 3: negligible during pretraining, rising with supervised fine-tuning, dropping with DPO, and rising again substantially during RL.

Read it on LessWrong →

Scaling Laws for LLM-Based Data Compression

Wed, 23 Jul 2025 10:00:00 +0000

How LLMs compress text, image, and speech, and the universal power laws that govern compression ratio as a function of model and data scale.

Read it on LessWrong →

Experiments with the Platonic Representation Hypothesis

Sun, 27 Oct 2024 10:00:00 +0000

Testing whether the Platonic Representation Hypothesis — that models across modalities converge to a shared statistical model of reality — still holds once you move beyond in-distribution data.

Read it on LessWrong →

Understanding Hidden Computations in Chain-of-Thought Reasoning

Wed, 28 Aug 2024 01:49:00 +0000

Investigating how transformers keep reasoning when chain-of-thought steps are replaced by filler tokens, using the 3SUM task — and a modified greedy decoding scheme that recovers the hidden computation with 100% consistency.

Read it on LessWrong →

Adversarial training against goal misgeneralization is ELK-hard

Fri, 24 Mar 2023 10:00:00 +0000

An argument that solving goal misgeneralization in the worst case reduces to Eliciting Latent Knowledge: any adversarial-training scheme that relies on a non-deceiving prediction head runs straight into the ELK problem.

Read it on LessWrong →

The AGI needs to be honest

Sat, 16 Oct 2021 10:00:00 +0000

On why certifying that a superintelligent system is intelligent first requires proving that it is honest — and why honesty is the hard-to-find global optimum among easy-to-find deceptive ones.

Read it on LessWrong →