v2 Run Report — Dual-Process LM

Target Achievement — 3 of 5 Met

AR Perplexity

29.65

Target: < 40

Met

v1: 26.9 (+10%)

AUROC

0.863

Target: > 0.75

Met

v1: 0.854 (+1%)

ECE

0.009

Target: < 0.05

Met

v1: 0.010 (-10%)

Diffusion Loss

4.70

Target: < 4.0 (83%)

Missed

v1: 4.13 (+14%)

S1 Accuracy

22.0%

Target: 40% (55%)

Missed

v1: 28.7% (-23%)

Training Trajectory (v2 vs v1)

Evaluation Metrics by Step

Step	AR PPL	Diff Loss	S1 Acc	AUROC	ECE
1,000	21.4	6.74	5.0%	0.557	0.002
5,000	25.6	4.97	18.0%	0.824	0.003
10,000	28.4	4.37	25.1%	0.850	0.004
20,000	30.9	4.75	21.3%	0.851	0.007
30,000	31.7	4.08	28.1%	0.864	0.023
40,000	30.2	4.56	23.2%	0.857	0.007
50,000	29.65	4.70	22.0%	0.863	0.009

Key Observations

Regression λ_diff=2.0 Backfired on S1 Accuracy. Doubling the diffusion loss weight was intended to improve S1 token accuracy beyond v1’s 28.7%. Instead, S1 accuracy peaked at 28.1% (step 30k) then collapsed to 22.0% by step 50k. The higher λ appears to have destabilized the diffusion objective rather than strengthening it, suggesting the problem is not gradient magnitude but objective interference at a deeper level.
Concern Diffusion Loss Volatility. Diffusion loss showed non-monotonic behavior: 6.74 → 4.08 (step 30k, a new v2 best) → 4.70 (step 50k). The oscillation between 4.0–4.7 in the final 20k steps indicates training instability. v1’s smoother 7.91 → 4.13 descent suggests λ=1.0 provided a more stable gradient landscape.
Finding AR Perplexity Degraded but Recovered. AR PPL peaked at 31.7 (step 30k) — worse than v1’s 28.8 peak — then recovered to 29.65. The higher diffusion weight imposed a greater tax on the AR objective. Despite this, the final PPL remains well within the <40 target and the recovery trend mirrors v1’s behavior.
Positive Confidence Head Robust. AUROC (0.863) and ECE (0.009) both slightly improved over v1 (0.854 / 0.010). The confidence head appears insensitive to the λ rebalancing, maintaining strong calibration regardless of the objective weighting. This is encouraging for the escalation mechanism.
Finding Spot Resilience Validated at Scale. v2 endured 31 spot reclamations across 15 instances and 3 availability zones — 10× more interruptions than v1 — without data loss. The bootstrap + S3 sync architecture handled all recoveries autonomously. Total cost stayed within budget at $31.44 (62% spot savings).

Spot Instance Cost History

#	Type	AZ	Steps	Cost	Finalized
1	g5.2xlarge	us-east-1b	1 – 9,300	$5.11	Yes
2	g5.2xlarge	us-east-1b	9,100 – 10,600	$0.16	Yes
3	g5.2xlarge	us-east-1b	10,600 – 11,200	$1.34	Yes
4	g5.xlarge	us-east-1b	11,300	$0.00	Yes
5	g5.xlarge	us-east-1b	11,300	$0.17	Yes
6	g5.2xlarge	us-east-1a	11,100 – 12,500	$0.97	Yes
7	g5.2xlarge	us-east-1a	12,200 – 12,500	$0.31	Yes
8	g5.2xlarge	us-east-1f	12,300 – 13,200	$0.81	Yes
9	g5.xlarge	us-east-1f	13,200 – 14,500	$1.11	Yes
10	g5.2xlarge	us-east-1f	14,100 – 21,900	$4.39	Yes
11	g5.2xlarge	us-east-1b	21,000 – 24,100	$2.12	Yes
12	g5.2xlarge	us-east-1a	24,000 – 28,500	$2.54	Yes
13	g5.2xlarge	us-east-1b	28,600 – 41,000	$7.23	Yes
14	g5.xlarge	us-east-1b	41,000	$0.09	Yes
15	g5.2xlarge	us-east-1b	41,000 – 50,000	$5.10	Yes

Total (15 sessions, 31 reclamations, 3 AZs) $31.44 (62% spot savings)

Configuration

Model

GPT-2 Small (124M)

Config

tiny.yaml

Total Steps

50,000

Precision

bfloat16

GPU

NVIDIA A10G (24GB)

Instance

g5.2xlarge / g5.xlarge (spot)

λ AR / Diff

1.0 / 2.0

Checkpoints

50 (every 1k steps)

Data

OpenWebText

Tokenizer

GPT-2 (50,257 vocab)

Duration

~72 hours

Total Cost