v1 Run Report — Dual-Process LM

Target Achievement — 3 of 5 Met

AR Perplexity

26.9

Target: < 40

Met

AUROC

0.854

Target: > 0.75

Met

ECE

0.010

Target: < 0.05

Met

Diffusion Loss

4.13

Target: < 4.0 (97%)

Near

S1 Accuracy

28.7%

Target: 40% (72%)

Missed

Training Trajectory

Evaluation Metrics by Step

Step	AR PPL	Diff Loss	S1 Acc	AUROC	ECE
50	22.4	7.91	3.7%	0.502	0.048
1,000	21.2	6.86	4.8%	0.559	0.003
5,000	24.3	6.11	9.3%	0.695	0.011
10,000	26.5	5.41	14.7%	0.791	0.011
20,000	28.8	4.61	22.3%	0.847	0.005
30,000	28.5	4.33	23.3%	0.860	0.009
40,000	27.4	4.21	27.0%	0.870	0.010
50,000	26.9	4.13	28.7%	0.854	0.010

Key Observations

Finding Objective Interference. AR perplexity rose from 22 to 29 during early joint training as the diffusion objective competed for shared weight capacity. Diffusion loss at equal λ=1.0 was ~1.6× AR loss, creating an imbalanced gradient signal. AR PPL partially recovered in the final 20k steps (29 → 27), suggesting the model found a better equilibrium. λ rebalancing is the primary lever for v2.
Concern S1 Accuracy Plateau. System 1 token accuracy accelerated through step 30k (3.7% → 23.3%) but slowed significantly in the final 20k steps (+5.4pp). At 28.7% the target of 40% was not reached. The plateau may indicate a capacity ceiling for GPT-2 Small under joint training, or a need for curriculum/noise schedule tuning in the diffusion objective.
Positive AUROC Peak & Stability. Confidence head AUROC reached 0.870 at step 40k, exceeding the 0.75 target by a wide margin. The slight dip to 0.854 at step 50k is likely noise rather than degradation. ECE remained excellent (≤0.011) throughout, indicating well-calibrated confidence scores.
Positive AR PPL Recovery. After peaking at 28.8 (step 20k), AR perplexity recovered to 26.9 by step 50k despite continued diffusion training. This suggests the shared weights eventually accommodate both objectives, though with a ~5 PPL cost vs the pretrained baseline (~22).

Spot Instance Cost History

#	AZ	Steps	Boot (UTC)	Event	Cost
1	us-east-1a	2.3k → 28.7k	2026-03-01 02:33	Reclaimed	$14.66
2	us-east-1f	28.8k → 31.8k	2026-03-02 12:25	Reclaimed	$2.14
3	us-east-1b	31.8k → 35k	2026-03-02 17:17	Reclaimed	$2.25
4	us-east-1b	35k → 50k	2026-03-02 22:24	Completed	$8.57

Total across 4 instances (3 reclamation recoveries) $27.62

g5.2xlarge spot (~$0.43/hr on-demand, ~63% savings via spot) · Checkpoints every 1,000 steps · Autonomous bootstrap recovery

Run Configuration

Model

GPT-2 Small (124M)

Config

tiny.yaml

Total Steps

50,000

Precision

bfloat16

GPU

NVIDIA A10G (24GB)

Instance

g5.2xlarge (spot)

λ AR / Diff

1.0 / 1.0

Checkpoints

50 (every 1k steps)

Data

OpenWebText

Tokenizer

GPT-2 (50,257 vocab)

Duration

~48 hours

Total Cost