v3 — Dual-Process LM Run Report

gpt2 · 50,000 steps · 2026-03-08 to 2026-03-12 · $40.48 total

Target Achievement — 3 of 5 Met

AR Perplexity
27.99
Target: < 40
Met
AUROC
0.870
Target: > 0.75
Met
ECE
0.012
Target: < 0.05
Met
Diffusion Loss
4.16
Target: < 4.0
Near
S1 Accuracy
26.5%
Target: 40%
Missed

Training Trajectory (v3 vs v2)

Evaluation Metrics by Step

Step AR PPL Diff Loss S1 Acc AUROC ECE
1,000 21.29 6.75 5.0% 0.557 0.006
2,000 22.53 6.54 6.4% 0.613 0.004
3,000 23.17 6.38 7.6% 0.638 0.010
4,000 23.71 6.29 8.5% 0.672 0.011
5,000 24.35 6.12 9.1% 0.695 0.008
6,000 24.85 6.08 9.9% 0.719 0.012
7,000 25.48 5.87 10.6% 0.739 0.009
8,000 26.29 5.60 12.6% 0.785 0.008
9,000 26.95 5.06 18.4% 0.805 0.006
10,000 27.55 4.98 19.0% 0.828 0.005
11,000 27.85 4.43 22.0% 0.853 0.010
12,000 28.12 4.31 25.0% 0.853 0.007
13,000 28.41 4.42 24.1% 0.844 0.011
14,000 28.51 4.29 24.7% 0.852 0.009
15,000 28.64 4.50 23.7% 0.864 0.005
16,000 28.66 4.38 23.5% 0.856 0.010
17,000 28.89 4.34 25.2% 0.858 0.008
18,000 28.99 4.44 23.0% 0.858 0.010
19,000 29.21 4.39 22.1% 0.866 0.011
20,000 29.22 4.24 26.8% 0.857 0.005
21,000 29.94 4.26 26.8% 0.856 0.012
22,000 29.70 3.95 28.3% 0.876 0.004
23,000 29.57 4.19 26.1% 0.861 0.006
24,000 29.53 4.31 25.2% 0.862 0.005
25,000 29.48 4.08 26.5% 0.861 0.004
26,000 29.58 4.02 27.7% 0.864 0.006
27,000 29.55 4.32 24.6% 0.866 0.011
28,000 29.40 4.51 23.8% 0.865 0.007
29,000 29.36 4.27 25.5% 0.867 0.012
30,000 29.16 4.34 24.6% 0.868 0.005
31,000 28.95 4.47 23.3% 0.871 0.003
32,000 29.04 4.35 24.2% 0.865 0.009
33,000 28.93 4.09 26.6% 0.861 0.009
34,000 28.85 4.15 25.6% 0.871 0.007
35,000 28.73 4.26 25.0% 0.863 0.006
36,000 28.59 4.46 23.5% 0.864 0.011
37,000 28.51 4.52 24.1% 0.856 0.006
38,000 28.43 4.02 28.5% 0.864 0.009
39,000 28.34 4.04 29.0% 0.863 0.011
40,000 28.27 3.76 30.3% 0.881 0.009
41,000 28.30 3.95 27.8% 0.866 0.011
42,000 28.33 3.89 29.1% 0.870 0.013
43,000 28.14 4.20 25.9% 0.869 0.010
44,000 28.07 4.40 24.9% 0.867 0.010
45,000 27.95 4.16 26.5% 0.870 0.011
46,000 28.13 3.94 28.1% 0.866 0.016
47,000 28.09 3.88 29.3% 0.870 0.014
48,000 28.04 4.19 26.1% 0.870 0.012
49,000 28.05 4.41 24.9% 0.867 0.012
50,000 27.99 4.16 26.5% 0.870 0.012

Benchmark Results

BenchmarkMetricValue
LAMBADAS2 Accuracy92.62%
LAMBADAS2 Perplexity1.45
LAMBADAS1 Accuracy0.0%
WikiText-103S2 Perplexity45.76
WikiText-103S1 Loss4.11

System 1 vs System 2 Comparison

ECE
0.052
AUROC
0.500
Mean Confidence
0.052
SystemPerplexity
System 17330.75
System 213.64
Hybrid78.95
ThresholdEscalation Rate (%)
0.5100.0
0.6100.0
0.7100.0
0.8100.0
0.9100.0

Spot Instance Cost History

#TypeAZCostFinalized
1g5.2xlargeus-east-1a$3.11Yes
2g5.2xlargeus-east-1a$10.73Yes
3g6.xlargeus-east-1a$0.24Yes
4g6.xlargeus-east-1b$0.30Yes
5g5.2xlargeus-east-1b$0.94Yes
6g5.xlargeus-east-1b$0.09Yes
7g5.2xlargeus-east-1b$0.04Yes
8g5.2xlargeus-east-1b$0.05Yes
9g6.2xlargeus-east-1a$0.08Yes
10g6.xlargeus-east-1b$0.03Yes
11g5.2xlargeus-east-1b$0.03Yes
12g6.2xlargeus-east-1b$0.40Yes
13g5.xlargeus-east-1b$0.20Yes
14g6.xlargeus-east-1b$0.05Yes
15g5.xlargeus-east-1b$0.30Yes
16g6.2xlargeus-east-1b$0.30Yes
17g5.xlargeus-east-1b$0.28Yes
18g5.2xlargeus-east-1b$2.87Yes
19g6.2xlargeus-east-1a$11.02Yes
20g5.2xlargeus-east-1f$1.31Yes
21g6.2xlargeus-east-1b$3.92Yes
22g6.2xlargeus-east-1a$3.99Yes
23g5.2xlargeus-east-1b$0.08Yes
24g6.2xlargeus-east-1b$0.07Yes
25g6.2xlargeus-east-1b$0.06Yes
Total (25 sessions)$40.48

Configuration

Model
gpt2
Total Steps
50,000
Precision
bfloat16
λ AR / Diff
1.0 / 1.3
Data
OpenWebText
Total Cost
$40.48