yolay commited on
Commit
c9313e0
·
verified ·
1 Parent(s): 1a900b4

Model save

Browse files
Files changed (5) hide show
  1. README.md +67 -0
  2. all_results.json +8 -0
  3. generation_config.json +14 -0
  4. train_results.json +8 -0
  5. trainer_state.json +1809 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ model_name: Qwen2.5-1.5B-Open-R1-GRPO
4
+ tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - grpo
8
+ licence: license
9
+ ---
10
+
11
+ # Model Card for Qwen2.5-1.5B-Open-R1-GRPO
12
+
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
+
16
+ ## Quick start
17
+
18
+ ```python
19
+ from transformers import pipeline
20
+
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="yolay/Qwen2.5-1.5B-Open-R1-GRPO", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
26
+
27
+ ## Training procedure
28
+
29
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yuleiqin-tencent/huggingface/runs/5ngah7mu)
30
+
31
+
32
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
33
+
34
+ ### Framework versions
35
+
36
+ - TRL: 0.15.0.dev0
37
+ - Transformers: 4.49.0.dev0
38
+ - Pytorch: 2.5.1
39
+ - Datasets: 3.2.0
40
+ - Tokenizers: 0.21.0
41
+
42
+ ## Citations
43
+
44
+ Cite GRPO as:
45
+
46
+ ```bibtex
47
+ @article{zhihong2024deepseekmath,
48
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
49
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
50
+ year = 2024,
51
+ eprint = {arXiv:2402.03300},
52
+ }
53
+
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.13680233840041295,
4
+ "train_runtime": 51841.8905,
5
+ "train_samples": 72441,
6
+ "train_samples_per_second": 1.397,
7
+ "train_steps_per_second": 0.012
8
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.1,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "4.49.0.dev0"
14
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.13680233840041295,
4
+ "train_runtime": 51841.8905,
5
+ "train_samples": 72441,
6
+ "train_samples_per_second": 1.397,
7
+ "train_steps_per_second": 0.012
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1809 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9987438399845395,
5
+ "eval_steps": 100,
6
+ "global_step": 646,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "completion_length": 398.2607320785522,
13
+ "epoch": 0.007730215479756498,
14
+ "grad_norm": 0.6042563694653488,
15
+ "kl": 0.00012127757072448731,
16
+ "learning_rate": 1.5384615384615387e-06,
17
+ "loss": 0.0,
18
+ "reward": 0.6366071738302708,
19
+ "reward_std": 0.32451150137931106,
20
+ "rewards/accuracy_reward": 0.1723214378580451,
21
+ "rewards/format_reward": 0.46428573289886116,
22
+ "step": 5
23
+ },
24
+ {
25
+ "completion_length": 357.93305110931396,
26
+ "epoch": 0.015460430959512996,
27
+ "grad_norm": 0.7802913073139134,
28
+ "kl": 0.007133913040161133,
29
+ "learning_rate": 3.0769230769230774e-06,
30
+ "loss": 0.0003,
31
+ "reward": 0.7535714633762837,
32
+ "reward_std": 0.26516504064202306,
33
+ "rewards/accuracy_reward": 0.11964286342263222,
34
+ "rewards/format_reward": 0.6339285995811224,
35
+ "step": 10
36
+ },
37
+ {
38
+ "completion_length": 291.9857277870178,
39
+ "epoch": 0.023190646439269495,
40
+ "grad_norm": 0.4454731956977912,
41
+ "kl": 0.0273590087890625,
42
+ "learning_rate": 4.615384615384616e-06,
43
+ "loss": 0.0011,
44
+ "reward": 0.8758928969502449,
45
+ "reward_std": 0.22602162957191468,
46
+ "rewards/accuracy_reward": 0.08214286118745803,
47
+ "rewards/format_reward": 0.7937500394880772,
48
+ "step": 15
49
+ },
50
+ {
51
+ "completion_length": 291.47322788238523,
52
+ "epoch": 0.03092086191902599,
53
+ "grad_norm": 0.4324622452746649,
54
+ "kl": 0.01610870361328125,
55
+ "learning_rate": 6.153846153846155e-06,
56
+ "loss": 0.0006,
57
+ "reward": 0.8642857551574707,
58
+ "reward_std": 0.25253813322633506,
59
+ "rewards/accuracy_reward": 0.09285714775323868,
60
+ "rewards/format_reward": 0.7714286111295223,
61
+ "step": 20
62
+ },
63
+ {
64
+ "completion_length": 226.4142951965332,
65
+ "epoch": 0.03865107739878249,
66
+ "grad_norm": 0.3675938596678119,
67
+ "kl": 0.026679229736328126,
68
+ "learning_rate": 7.692307692307694e-06,
69
+ "loss": 0.0011,
70
+ "reward": 0.9687500387430191,
71
+ "reward_std": 0.21592010390013455,
72
+ "rewards/accuracy_reward": 0.10178571976721287,
73
+ "rewards/format_reward": 0.8669643253087997,
74
+ "step": 25
75
+ },
76
+ {
77
+ "completion_length": 197.68036499023438,
78
+ "epoch": 0.04638129287853899,
79
+ "grad_norm": 0.3061792577373215,
80
+ "kl": 0.033779144287109375,
81
+ "learning_rate": 9.230769230769232e-06,
82
+ "loss": 0.0014,
83
+ "reward": 1.04107146859169,
84
+ "reward_std": 0.16920054908841847,
85
+ "rewards/accuracy_reward": 0.10803572004660963,
86
+ "rewards/format_reward": 0.9330357395112514,
87
+ "step": 30
88
+ },
89
+ {
90
+ "completion_length": 196.36340112686156,
91
+ "epoch": 0.054111508358295486,
92
+ "grad_norm": 0.2761290868371635,
93
+ "kl": 0.0447174072265625,
94
+ "learning_rate": 1.076923076923077e-05,
95
+ "loss": 0.0018,
96
+ "reward": 1.1089286178350448,
97
+ "reward_std": 0.14647211749106645,
98
+ "rewards/accuracy_reward": 0.13750000707805157,
99
+ "rewards/format_reward": 0.9714285843074322,
100
+ "step": 35
101
+ },
102
+ {
103
+ "completion_length": 233.76518907546998,
104
+ "epoch": 0.06184172383805198,
105
+ "grad_norm": 0.6542348704176013,
106
+ "kl": 0.06229248046875,
107
+ "learning_rate": 1.230769230769231e-05,
108
+ "loss": 0.0025,
109
+ "reward": 1.1125000432133674,
110
+ "reward_std": 0.1313198298215866,
111
+ "rewards/accuracy_reward": 0.1401785789988935,
112
+ "rewards/format_reward": 0.9723214417695999,
113
+ "step": 40
114
+ },
115
+ {
116
+ "completion_length": 189.32232856750488,
117
+ "epoch": 0.06957193931780849,
118
+ "grad_norm": 0.3188732151645114,
119
+ "kl": 0.0878082275390625,
120
+ "learning_rate": 1.3846153846153847e-05,
121
+ "loss": 0.0035,
122
+ "reward": 1.0857143267989158,
123
+ "reward_std": 0.14142135493457317,
124
+ "rewards/accuracy_reward": 0.11339286295697093,
125
+ "rewards/format_reward": 0.9723214417695999,
126
+ "step": 45
127
+ },
128
+ {
129
+ "completion_length": 300.7259063720703,
130
+ "epoch": 0.07730215479756498,
131
+ "grad_norm": 0.31691534877890665,
132
+ "kl": 0.0990020751953125,
133
+ "learning_rate": 1.5384615384615387e-05,
134
+ "loss": 0.004,
135
+ "reward": 1.05267860814929,
136
+ "reward_std": 0.2133947243914008,
137
+ "rewards/accuracy_reward": 0.15000000735744834,
138
+ "rewards/format_reward": 0.9026786051690578,
139
+ "step": 50
140
+ },
141
+ {
142
+ "completion_length": 277.56429901123045,
143
+ "epoch": 0.08503237027732148,
144
+ "grad_norm": 0.3063068659110802,
145
+ "kl": 0.06175537109375,
146
+ "learning_rate": 1.6923076923076924e-05,
147
+ "loss": 0.0025,
148
+ "reward": 1.0723214723169803,
149
+ "reward_std": 0.21844548601657152,
150
+ "rewards/accuracy_reward": 0.15982143683359026,
151
+ "rewards/format_reward": 0.9125000342726708,
152
+ "step": 55
153
+ },
154
+ {
155
+ "completion_length": 322.78662223815917,
156
+ "epoch": 0.09276258575707798,
157
+ "grad_norm": 0.2797133381390773,
158
+ "kl": 0.0583984375,
159
+ "learning_rate": 1.8461538461538465e-05,
160
+ "loss": 0.0023,
161
+ "reward": 1.1383929125964642,
162
+ "reward_std": 0.15783633291721344,
163
+ "rewards/accuracy_reward": 0.1839285811409354,
164
+ "rewards/format_reward": 0.9544643051922321,
165
+ "step": 60
166
+ },
167
+ {
168
+ "completion_length": 290.8223346710205,
169
+ "epoch": 0.10049280123683448,
170
+ "grad_norm": 0.22020075522546995,
171
+ "kl": 0.072247314453125,
172
+ "learning_rate": 2e-05,
173
+ "loss": 0.0029,
174
+ "reward": 1.1482143357396126,
175
+ "reward_std": 0.133845211006701,
176
+ "rewards/accuracy_reward": 0.16071429401636123,
177
+ "rewards/format_reward": 0.9875000059604645,
178
+ "step": 65
179
+ },
180
+ {
181
+ "completion_length": 296.99554920196533,
182
+ "epoch": 0.10822301671659097,
183
+ "grad_norm": 0.25171935257044675,
184
+ "kl": 0.077728271484375,
185
+ "learning_rate": 1.999634547413886e-05,
186
+ "loss": 0.0031,
187
+ "reward": 1.157142909616232,
188
+ "reward_std": 0.16414978671818972,
189
+ "rewards/accuracy_reward": 0.19285715268924833,
190
+ "rewards/format_reward": 0.9642857298254967,
191
+ "step": 70
192
+ },
193
+ {
194
+ "completion_length": 321.1794790267944,
195
+ "epoch": 0.11595323219634747,
196
+ "grad_norm": 0.35630454010296175,
197
+ "kl": 0.083953857421875,
198
+ "learning_rate": 1.9985384567667278e-05,
199
+ "loss": 0.0034,
200
+ "reward": 1.0901786163449287,
201
+ "reward_std": 0.22602162901312112,
202
+ "rewards/accuracy_reward": 0.1794642936438322,
203
+ "rewards/format_reward": 0.9107143178582191,
204
+ "step": 75
205
+ },
206
+ {
207
+ "completion_length": 253.01072540283204,
208
+ "epoch": 0.12368344767610397,
209
+ "grad_norm": 0.22028353611198634,
210
+ "kl": 0.090179443359375,
211
+ "learning_rate": 1.9967125291968495e-05,
212
+ "loss": 0.0036,
213
+ "reward": 1.0892857559025289,
214
+ "reward_std": 0.19950512517243624,
215
+ "rewards/accuracy_reward": 0.15892858076840638,
216
+ "rewards/format_reward": 0.9303571678698063,
217
+ "step": 80
218
+ },
219
+ {
220
+ "completion_length": 221.2348321914673,
221
+ "epoch": 0.13141366315586048,
222
+ "grad_norm": 0.23846717001551665,
223
+ "kl": 0.103912353515625,
224
+ "learning_rate": 1.9941580992841562e-05,
225
+ "loss": 0.0042,
226
+ "reward": 1.1187500461935997,
227
+ "reward_std": 0.13258252013474703,
228
+ "rewards/accuracy_reward": 0.14464286463335158,
229
+ "rewards/format_reward": 0.9741071552038193,
230
+ "step": 85
231
+ },
232
+ {
233
+ "completion_length": 238.93393840789795,
234
+ "epoch": 0.13914387863561697,
235
+ "grad_norm": 0.28253877862072746,
236
+ "kl": 0.10802001953125,
237
+ "learning_rate": 1.990877034074683e-05,
238
+ "loss": 0.0043,
239
+ "reward": 1.1500000476837158,
240
+ "reward_std": 0.15909902472048998,
241
+ "rewards/accuracy_reward": 0.16964286640286447,
242
+ "rewards/format_reward": 0.980357151478529,
243
+ "step": 90
244
+ },
245
+ {
246
+ "completion_length": 384.65805397033694,
247
+ "epoch": 0.14687409411537347,
248
+ "grad_norm": 0.246515365890744,
249
+ "kl": 320.1103820800781,
250
+ "learning_rate": 1.9868717317159617e-05,
251
+ "loss": 12.8246,
252
+ "reward": 1.1482143349945546,
253
+ "reward_std": 0.2323350828140974,
254
+ "rewards/accuracy_reward": 0.2267857251688838,
255
+ "rewards/format_reward": 0.9214286006987095,
256
+ "step": 95
257
+ },
258
+ {
259
+ "completion_length": 430.77234039306643,
260
+ "epoch": 0.15460430959512997,
261
+ "grad_norm": 0.21987794238704425,
262
+ "kl": 0.212005615234375,
263
+ "learning_rate": 1.9821451197042028e-05,
264
+ "loss": 0.0085,
265
+ "reward": 1.0955357640981673,
266
+ "reward_std": 0.28410540148615837,
267
+ "rewards/accuracy_reward": 0.21785715371370315,
268
+ "rewards/format_reward": 0.877678605914116,
269
+ "step": 100
270
+ },
271
+ {
272
+ "epoch": 0.15460430959512997,
273
+ "eval_completion_length": 306.96846771240234,
274
+ "eval_kl": 0.09576416015625,
275
+ "eval_loss": 0.00375110050663352,
276
+ "eval_reward": 1.1964286267757416,
277
+ "eval_reward_std": 0.21465741284191608,
278
+ "eval_rewards/accuracy_reward": 0.2366071566939354,
279
+ "eval_rewards/format_reward": 0.9598214477300644,
280
+ "eval_runtime": 39.1469,
281
+ "eval_samples_per_second": 2.529,
282
+ "eval_steps_per_second": 0.102,
283
+ "step": 100
284
+ },
285
+ {
286
+ "completion_length": 296.79733486175536,
287
+ "epoch": 0.16233452507488647,
288
+ "grad_norm": 0.22613704416746724,
289
+ "kl": 0.101495361328125,
290
+ "learning_rate": 1.9767006527445728e-05,
291
+ "loss": 0.0041,
292
+ "reward": 1.1491071999073028,
293
+ "reward_std": 0.17551400382071733,
294
+ "rewards/accuracy_reward": 0.1830357233993709,
295
+ "rewards/format_reward": 0.9660714447498322,
296
+ "step": 105
297
+ },
298
+ {
299
+ "completion_length": 231.32143831253052,
300
+ "epoch": 0.17006474055464296,
301
+ "grad_norm": 0.2736697715915322,
302
+ "kl": 0.115789794921875,
303
+ "learning_rate": 1.9705423102261324e-05,
304
+ "loss": 0.0046,
305
+ "reward": 1.128571480512619,
306
+ "reward_std": 0.1818274561315775,
307
+ "rewards/accuracy_reward": 0.15892858002334834,
308
+ "rewards/format_reward": 0.9696428716182709,
309
+ "step": 110
310
+ },
311
+ {
312
+ "completion_length": 217.31161766052247,
313
+ "epoch": 0.17779495603439946,
314
+ "grad_norm": 0.2715660388125083,
315
+ "kl": 0.134747314453125,
316
+ "learning_rate": 1.9636745933132807e-05,
317
+ "loss": 0.0054,
318
+ "reward": 1.1080357626080513,
319
+ "reward_std": 0.16793785914778708,
320
+ "rewards/accuracy_reward": 0.1410714359022677,
321
+ "rewards/format_reward": 0.9669643007218838,
322
+ "step": 115
323
+ },
324
+ {
325
+ "completion_length": 264.1339414596558,
326
+ "epoch": 0.18552517151415596,
327
+ "grad_norm": 0.28581146278588754,
328
+ "kl": 0.150201416015625,
329
+ "learning_rate": 1.956102521655831e-05,
330
+ "loss": 0.006,
331
+ "reward": 1.1258929021656514,
332
+ "reward_std": 0.21339472178369762,
333
+ "rewards/accuracy_reward": 0.1919642947614193,
334
+ "rewards/format_reward": 0.9339285977184772,
335
+ "step": 120
336
+ },
337
+ {
338
+ "completion_length": 292.79108371734617,
339
+ "epoch": 0.19325538699391245,
340
+ "grad_norm": 0.7258310743765034,
341
+ "kl": 0.18944091796875,
342
+ "learning_rate": 1.9478316297201218e-05,
343
+ "loss": 0.0076,
344
+ "reward": 1.1303571976721287,
345
+ "reward_std": 0.2424366096034646,
346
+ "rewards/accuracy_reward": 0.2214285818859935,
347
+ "rewards/format_reward": 0.9089286021888257,
348
+ "step": 125
349
+ },
350
+ {
351
+ "completion_length": 224.37768850326538,
352
+ "epoch": 0.20098560247366895,
353
+ "grad_norm": 0.5660752703191607,
354
+ "kl": 0.51148681640625,
355
+ "learning_rate": 1.9388679627438486e-05,
356
+ "loss": 0.0205,
357
+ "reward": 1.0053571909666061,
358
+ "reward_std": 0.33082495592534544,
359
+ "rewards/accuracy_reward": 0.20267858086153864,
360
+ "rewards/format_reward": 0.8026786111295223,
361
+ "step": 130
362
+ },
363
+ {
364
+ "completion_length": 136.95804138183593,
365
+ "epoch": 0.20871581795342545,
366
+ "grad_norm": 1.7299812104574703,
367
+ "kl": 0.520361328125,
368
+ "learning_rate": 1.9292180723175656e-05,
369
+ "loss": 0.0208,
370
+ "reward": 0.9401786185801029,
371
+ "reward_std": 0.39269680101424453,
372
+ "rewards/accuracy_reward": 0.17589286603033544,
373
+ "rewards/format_reward": 0.7642857521772385,
374
+ "step": 135
375
+ },
376
+ {
377
+ "completion_length": 292.2187627792358,
378
+ "epoch": 0.21644603343318194,
379
+ "grad_norm": 0.360874201843721,
380
+ "kl": 0.4266845703125,
381
+ "learning_rate": 1.9188890115960967e-05,
382
+ "loss": 0.0171,
383
+ "reward": 1.0937500461935996,
384
+ "reward_std": 0.2032931974157691,
385
+ "rewards/accuracy_reward": 0.1535714365541935,
386
+ "rewards/format_reward": 0.9401785939931869,
387
+ "step": 140
388
+ },
389
+ {
390
+ "completion_length": 203.01786556243897,
391
+ "epoch": 0.22417624891293844,
392
+ "grad_norm": 0.3748600306818795,
393
+ "kl": 0.28094482421875,
394
+ "learning_rate": 1.9078883301433488e-05,
395
+ "loss": 0.0112,
396
+ "reward": 1.090178620815277,
397
+ "reward_std": 0.20581857804208994,
398
+ "rewards/accuracy_reward": 0.16607143692672252,
399
+ "rewards/format_reward": 0.9241071708500386,
400
+ "step": 145
401
+ },
402
+ {
403
+ "completion_length": 325.5580488204956,
404
+ "epoch": 0.23190646439269494,
405
+ "grad_norm": 74.76840594405346,
406
+ "kl": 0.74801025390625,
407
+ "learning_rate": 1.8962240684142923e-05,
408
+ "loss": 0.0299,
409
+ "reward": 0.8758928962051868,
410
+ "reward_std": 0.3219861214980483,
411
+ "rewards/accuracy_reward": 0.12589286426082252,
412
+ "rewards/format_reward": 0.750000037997961,
413
+ "step": 150
414
+ },
415
+ {
416
+ "completion_length": 371.35894470214845,
417
+ "epoch": 0.23963667987245144,
418
+ "grad_norm": 72.4191179163845,
419
+ "kl": 0.8891845703125,
420
+ "learning_rate": 1.883904751878156e-05,
421
+ "loss": 0.0356,
422
+ "reward": 0.8571428939700126,
423
+ "reward_std": 0.36618029810488223,
424
+ "rewards/accuracy_reward": 0.1589285798370838,
425
+ "rewards/format_reward": 0.6982143167406321,
426
+ "step": 155
427
+ },
428
+ {
429
+ "completion_length": 315.3026921272278,
430
+ "epoch": 0.24736689535220793,
431
+ "grad_norm": 1.580738805644984,
432
+ "kl": 0.48006591796875,
433
+ "learning_rate": 1.8709393847871146e-05,
434
+ "loss": 0.0192,
435
+ "reward": 0.8991071823984385,
436
+ "reward_std": 0.32451150212436913,
437
+ "rewards/accuracy_reward": 0.16428572265431285,
438
+ "rewards/format_reward": 0.7348214674741029,
439
+ "step": 160
440
+ },
441
+ {
442
+ "completion_length": 380.33751621246336,
443
+ "epoch": 0.25509711083196446,
444
+ "grad_norm": 155.90661885034666,
445
+ "kl": 2.71826171875,
446
+ "learning_rate": 1.857337443595034e-05,
447
+ "loss": 0.1089,
448
+ "reward": 0.9276786163449288,
449
+ "reward_std": 0.3750191332772374,
450
+ "rewards/accuracy_reward": 0.16696429420262576,
451
+ "rewards/format_reward": 0.7607143238186836,
452
+ "step": 165
453
+ },
454
+ {
455
+ "completion_length": 391.8839458465576,
456
+ "epoch": 0.26282732631172095,
457
+ "grad_norm": 2.7868948262319475,
458
+ "kl": 0.80675048828125,
459
+ "learning_rate": 1.8431088700310846e-05,
460
+ "loss": 0.0323,
461
+ "reward": 1.1080357603728772,
462
+ "reward_std": 0.18814090844243764,
463
+ "rewards/accuracy_reward": 0.16428572246804835,
464
+ "rewards/format_reward": 0.9437500201165676,
465
+ "step": 170
466
+ },
467
+ {
468
+ "completion_length": 242.89018936157225,
469
+ "epoch": 0.27055754179147745,
470
+ "grad_norm": 0.42925991415337916,
471
+ "kl": 0.2980712890625,
472
+ "learning_rate": 1.8282640638332773e-05,
473
+ "loss": 0.0119,
474
+ "reward": 0.9973214760422706,
475
+ "reward_std": 0.2790546391159296,
476
+ "rewards/accuracy_reward": 0.16696429438889027,
477
+ "rewards/format_reward": 0.8303571760654449,
478
+ "step": 175
479
+ },
480
+ {
481
+ "completion_length": 420.8875186920166,
482
+ "epoch": 0.27828775727123395,
483
+ "grad_norm": 5.77238039316987,
484
+ "kl": 2.53583984375,
485
+ "learning_rate": 1.8128138751472432e-05,
486
+ "loss": 0.1013,
487
+ "reward": 0.6794643143191934,
488
+ "reward_std": 0.38764603715389967,
489
+ "rewards/accuracy_reward": 0.13125000707805157,
490
+ "rewards/format_reward": 0.5482143117114902,
491
+ "step": 180
492
+ },
493
+ {
494
+ "completion_length": 377.3401969909668,
495
+ "epoch": 0.28601797275099045,
496
+ "grad_norm": 46.08902975257459,
497
+ "kl": 1.4901123046875,
498
+ "learning_rate": 1.7967695965958044e-05,
499
+ "loss": 0.0597,
500
+ "reward": 0.9267857566475868,
501
+ "reward_std": 0.351028005965054,
502
+ "rewards/accuracy_reward": 0.1696428656578064,
503
+ "rewards/format_reward": 0.7571428924798965,
504
+ "step": 185
505
+ },
506
+ {
507
+ "completion_length": 269.395546913147,
508
+ "epoch": 0.29374818823074694,
509
+ "grad_norm": 0.3753752369724839,
510
+ "kl": 1.59666748046875,
511
+ "learning_rate": 1.780142955025139e-05,
512
+ "loss": 0.064,
513
+ "reward": 1.1169643260538578,
514
+ "reward_std": 0.22349624745547772,
515
+ "rewards/accuracy_reward": 0.2044642954133451,
516
+ "rewards/format_reward": 0.9125000283122062,
517
+ "step": 190
518
+ },
519
+ {
520
+ "completion_length": 285.7642984390259,
521
+ "epoch": 0.30147840371050344,
522
+ "grad_norm": 1.6098957935540266,
523
+ "kl": 1.1361083984375,
524
+ "learning_rate": 1.7629461029335683e-05,
525
+ "loss": 0.0454,
526
+ "reward": 1.0705357670783997,
527
+ "reward_std": 0.3017830714583397,
528
+ "rewards/accuracy_reward": 0.21339286854490638,
529
+ "rewards/format_reward": 0.8571428984403611,
530
+ "step": 195
531
+ },
532
+ {
533
+ "completion_length": 360.73573036193847,
534
+ "epoch": 0.30920861919025994,
535
+ "grad_norm": 24.77991428569718,
536
+ "kl": 1.9080322265625,
537
+ "learning_rate": 1.745191609589231e-05,
538
+ "loss": 0.0764,
539
+ "reward": 0.9089286141097546,
540
+ "reward_std": 0.3737564399838448,
541
+ "rewards/accuracy_reward": 0.15892858104780316,
542
+ "rewards/format_reward": 0.7500000361353159,
543
+ "step": 200
544
+ },
545
+ {
546
+ "epoch": 0.30920861919025994,
547
+ "eval_completion_length": 388.8497200012207,
548
+ "eval_kl": 4.3896484375,
549
+ "eval_loss": 0.16958686709403992,
550
+ "eval_reward": 1.0178571790456772,
551
+ "eval_reward_std": 0.404061034321785,
552
+ "eval_rewards/accuracy_reward": 0.2455357238650322,
553
+ "eval_rewards/format_reward": 0.7723214700818062,
554
+ "eval_runtime": 48.8596,
555
+ "eval_samples_per_second": 2.026,
556
+ "eval_steps_per_second": 0.082,
557
+ "step": 200
558
+ },
559
+ {
560
+ "completion_length": 370.7384090423584,
561
+ "epoch": 0.31693883467001643,
562
+ "grad_norm": 9.980763942338113,
563
+ "kl": 2.82935791015625,
564
+ "learning_rate": 1.7268924518431437e-05,
565
+ "loss": 0.1132,
566
+ "reward": 0.9776786178350448,
567
+ "reward_std": 0.34218916948884726,
568
+ "rewards/accuracy_reward": 0.19910715324804187,
569
+ "rewards/format_reward": 0.7785714693367481,
570
+ "step": 205
571
+ },
572
+ {
573
+ "completion_length": 290.9964431762695,
574
+ "epoch": 0.32466905014977293,
575
+ "grad_norm": 1.9760739124872957,
576
+ "kl": 1.00716552734375,
577
+ "learning_rate": 1.7080620046443503e-05,
578
+ "loss": 0.0403,
579
+ "reward": 1.025000049173832,
580
+ "reward_std": 0.2752665659412742,
581
+ "rewards/accuracy_reward": 0.1750000081025064,
582
+ "rewards/format_reward": 0.8500000417232514,
583
+ "step": 210
584
+ },
585
+ {
586
+ "completion_length": 204.87858114242553,
587
+ "epoch": 0.33239926562952943,
588
+ "grad_norm": 0.6268244531901943,
589
+ "kl": 0.43160400390625,
590
+ "learning_rate": 1.6887140312641036e-05,
591
+ "loss": 0.0173,
592
+ "reward": 1.0633929058909417,
593
+ "reward_std": 0.25127544458955525,
594
+ "rewards/accuracy_reward": 0.16517857881262898,
595
+ "rewards/format_reward": 0.8982143223285675,
596
+ "step": 215
597
+ },
598
+ {
599
+ "completion_length": 247.21429824829102,
600
+ "epoch": 0.3401294811092859,
601
+ "grad_norm": 0.5949423088731468,
602
+ "kl": 0.675537109375,
603
+ "learning_rate": 1.6688626732362192e-05,
604
+ "loss": 0.027,
605
+ "reward": 0.9500000357627869,
606
+ "reward_std": 0.3005203790962696,
607
+ "rewards/accuracy_reward": 0.12946429271250964,
608
+ "rewards/format_reward": 0.820535758137703,
609
+ "step": 220
610
+ },
611
+ {
612
+ "completion_length": 190.4634015083313,
613
+ "epoch": 0.3478596965890424,
614
+ "grad_norm": 0.42764857782637755,
615
+ "kl": 0.33341064453125,
616
+ "learning_rate": 1.6485224400209557e-05,
617
+ "loss": 0.0133,
618
+ "reward": 1.0767857618629932,
619
+ "reward_std": 0.21213203240185977,
620
+ "rewards/accuracy_reward": 0.16250000838190318,
621
+ "rewards/format_reward": 0.9142857469618321,
622
+ "step": 225
623
+ },
624
+ {
625
+ "completion_length": 210.5357223510742,
626
+ "epoch": 0.3555899120687989,
627
+ "grad_norm": 24.683355476887744,
628
+ "kl": 3.64019775390625,
629
+ "learning_rate": 1.6277081983999742e-05,
630
+ "loss": 0.1459,
631
+ "reward": 1.141071478277445,
632
+ "reward_std": 0.20708126928657294,
633
+ "rewards/accuracy_reward": 0.19375001089647412,
634
+ "rewards/format_reward": 0.9473214522004128,
635
+ "step": 230
636
+ },
637
+ {
638
+ "completion_length": 314.76697731018066,
639
+ "epoch": 0.3633201275485554,
640
+ "grad_norm": 2.2235055363102907,
641
+ "kl": 0.845751953125,
642
+ "learning_rate": 1.6064351616101318e-05,
643
+ "loss": 0.0338,
644
+ "reward": 1.0339286170899868,
645
+ "reward_std": 0.27274118475615977,
646
+ "rewards/accuracy_reward": 0.15892857927829027,
647
+ "rewards/format_reward": 0.875000037252903,
648
+ "step": 235
649
+ },
650
+ {
651
+ "completion_length": 349.63215923309326,
652
+ "epoch": 0.3710503430283119,
653
+ "grad_norm": 2.116200281477274,
654
+ "kl": 1.61800537109375,
655
+ "learning_rate": 1.5847188782240473e-05,
656
+ "loss": 0.0647,
657
+ "reward": 1.0000000484287739,
658
+ "reward_std": 0.31062190532684325,
659
+ "rewards/accuracy_reward": 0.16339286621659993,
660
+ "rewards/format_reward": 0.8366071827709675,
661
+ "step": 240
662
+ },
663
+ {
664
+ "completion_length": 300.8339429855347,
665
+ "epoch": 0.3787805585080684,
666
+ "grad_norm": 1.1477379512060555,
667
+ "kl": 1.49195556640625,
668
+ "learning_rate": 1.562575220785569e-05,
669
+ "loss": 0.0597,
670
+ "reward": 1.0580357618629932,
671
+ "reward_std": 0.291681545227766,
672
+ "rewards/accuracy_reward": 0.17767858114093543,
673
+ "rewards/format_reward": 0.8803571790456772,
674
+ "step": 245
675
+ },
676
+ {
677
+ "completion_length": 299.05715465545654,
678
+ "epoch": 0.3865107739878249,
679
+ "grad_norm": 0.8348365648461077,
680
+ "kl": 1.25382080078125,
681
+ "learning_rate": 1.5400203742084508e-05,
682
+ "loss": 0.0502,
683
+ "reward": 1.0508929029107095,
684
+ "reward_std": 0.2841053992509842,
685
+ "rewards/accuracy_reward": 0.17946429466828703,
686
+ "rewards/format_reward": 0.8714286103844643,
687
+ "step": 250
688
+ },
689
+ {
690
+ "completion_length": 263.1178680419922,
691
+ "epoch": 0.3942409894675814,
692
+ "grad_norm": 4.699113934464697,
693
+ "kl": 1.31427001953125,
694
+ "learning_rate": 1.5170708239467143e-05,
695
+ "loss": 0.0526,
696
+ "reward": 1.0598214752972126,
697
+ "reward_std": 0.2613769697025418,
698
+ "rewards/accuracy_reward": 0.17321429392322898,
699
+ "rewards/format_reward": 0.8866071790456772,
700
+ "step": 255
701
+ },
702
+ {
703
+ "completion_length": 301.30447845458986,
704
+ "epoch": 0.4019712049473379,
705
+ "grad_norm": 25.580783149116392,
706
+ "kl": 4.94813232421875,
707
+ "learning_rate": 1.4937433439453465e-05,
708
+ "loss": 0.1981,
709
+ "reward": 1.0321429051458835,
710
+ "reward_std": 0.3080965233966708,
711
+ "rewards/accuracy_reward": 0.18750001015141607,
712
+ "rewards/format_reward": 0.8446428962051868,
713
+ "step": 260
714
+ },
715
+ {
716
+ "completion_length": 353.21697826385497,
717
+ "epoch": 0.4097014204270944,
718
+ "grad_norm": 21.725814777895923,
719
+ "kl": 3.99072265625,
720
+ "learning_rate": 1.4700549843801359e-05,
721
+ "loss": 0.1599,
722
+ "reward": 0.9294643267989159,
723
+ "reward_std": 0.3447145516052842,
724
+ "rewards/accuracy_reward": 0.1553571513853967,
725
+ "rewards/format_reward": 0.7741071842610836,
726
+ "step": 265
727
+ },
728
+ {
729
+ "completion_length": 289.2535852432251,
730
+ "epoch": 0.4174316359068509,
731
+ "grad_norm": 1.0129293730640343,
732
+ "kl": 1.88997802734375,
733
+ "learning_rate": 1.4460230591956097e-05,
734
+ "loss": 0.0756,
735
+ "reward": 1.027678620070219,
736
+ "reward_std": 0.2714784935116768,
737
+ "rewards/accuracy_reward": 0.1714285804890096,
738
+ "rewards/format_reward": 0.8562500402331352,
739
+ "step": 270
740
+ },
741
+ {
742
+ "completion_length": 251.7080472946167,
743
+ "epoch": 0.4251618513866074,
744
+ "grad_norm": 13.1390097339807,
745
+ "kl": 1.76666259765625,
746
+ "learning_rate": 1.421665133450184e-05,
747
+ "loss": 0.0707,
748
+ "reward": 1.1142857648432254,
749
+ "reward_std": 0.22223355881869794,
750
+ "rewards/accuracy_reward": 0.19017858151346445,
751
+ "rewards/format_reward": 0.9241071723401546,
752
+ "step": 275
753
+ },
754
+ {
755
+ "completion_length": 222.12590236663817,
756
+ "epoch": 0.4328920668663639,
757
+ "grad_norm": 0.28425211793760446,
758
+ "kl": 0.20849609375,
759
+ "learning_rate": 1.3969990104777712e-05,
760
+ "loss": 0.0083,
761
+ "reward": 1.1330357640981674,
762
+ "reward_std": 0.18561552856117486,
763
+ "rewards/accuracy_reward": 0.17589286677539348,
764
+ "rewards/format_reward": 0.9571428775787354,
765
+ "step": 280
766
+ },
767
+ {
768
+ "completion_length": 292.66430015563964,
769
+ "epoch": 0.4406222823461204,
770
+ "grad_norm": 0.2655799067132994,
771
+ "kl": 0.1830810546875,
772
+ "learning_rate": 1.3720427188752306e-05,
773
+ "loss": 0.0073,
774
+ "reward": 1.052678619325161,
775
+ "reward_std": 0.24875006265938282,
776
+ "rewards/accuracy_reward": 0.17232143683359027,
777
+ "rewards/format_reward": 0.8803571783006191,
778
+ "step": 285
779
+ },
780
+ {
781
+ "completion_length": 347.53840770721433,
782
+ "epoch": 0.4483524978258769,
783
+ "grad_norm": 0.2414520404968147,
784
+ "kl": 0.1782958984375,
785
+ "learning_rate": 1.3468144993251735e-05,
786
+ "loss": 0.0071,
787
+ "reward": 1.0098214797675609,
788
+ "reward_std": 0.28158001936972143,
789
+ "rewards/accuracy_reward": 0.1848214385099709,
790
+ "rewards/format_reward": 0.8250000409781932,
791
+ "step": 290
792
+ },
793
+ {
794
+ "completion_length": 290.5705478668213,
795
+ "epoch": 0.4560827133056334,
796
+ "grad_norm": 0.21789909324241594,
797
+ "kl": 0.162786865234375,
798
+ "learning_rate": 1.3213327912637563e-05,
799
+ "loss": 0.0065,
800
+ "reward": 1.0812500432133674,
801
+ "reward_std": 0.28410540260374545,
802
+ "rewards/accuracy_reward": 0.19107143776491284,
803
+ "rewards/format_reward": 0.8901786096394062,
804
+ "step": 295
805
+ },
806
+ {
807
+ "completion_length": 231.57590255737304,
808
+ "epoch": 0.4638129287853899,
809
+ "grad_norm": 0.24661177068993628,
810
+ "kl": 0.158062744140625,
811
+ "learning_rate": 1.295616219403197e-05,
812
+ "loss": 0.0063,
813
+ "reward": 1.0991071954369545,
814
+ "reward_std": 0.19319167286157607,
815
+ "rewards/accuracy_reward": 0.1589285796508193,
816
+ "rewards/format_reward": 0.9401785969734192,
817
+ "step": 300
818
+ },
819
+ {
820
+ "epoch": 0.4638129287853899,
821
+ "eval_completion_length": 219.2982234954834,
822
+ "eval_kl": 0.13836669921875,
823
+ "eval_loss": 0.005582114681601524,
824
+ "eval_reward": 1.1339286118745804,
825
+ "eval_reward_std": 0.2020305097103119,
826
+ "eval_rewards/accuracy_reward": 0.17410715483129025,
827
+ "eval_rewards/format_reward": 0.9598214477300644,
828
+ "eval_runtime": 36.5449,
829
+ "eval_samples_per_second": 2.709,
830
+ "eval_steps_per_second": 0.109,
831
+ "step": 300
832
+ },
833
+ {
834
+ "completion_length": 260.4241186141968,
835
+ "epoch": 0.4715431442651464,
836
+ "grad_norm": 0.24760948713769942,
837
+ "kl": 0.17685546875,
838
+ "learning_rate": 1.2696835801188816e-05,
839
+ "loss": 0.0071,
840
+ "reward": 1.0669643342494965,
841
+ "reward_std": 0.2209708673879504,
842
+ "rewards/accuracy_reward": 0.15625000838190317,
843
+ "rewards/format_reward": 0.9107143193483352,
844
+ "step": 305
845
+ },
846
+ {
847
+ "completion_length": 277.2196553230286,
848
+ "epoch": 0.47927335974490287,
849
+ "grad_norm": 0.3195688766813074,
850
+ "kl": 0.1870849609375,
851
+ "learning_rate": 1.2435538277109919e-05,
852
+ "loss": 0.0075,
853
+ "reward": 1.0392857626080514,
854
+ "reward_std": 0.26011427883058785,
855
+ "rewards/accuracy_reward": 0.15089286481961608,
856
+ "rewards/format_reward": 0.8883928909897805,
857
+ "step": 310
858
+ },
859
+ {
860
+ "completion_length": 238.69644145965577,
861
+ "epoch": 0.48700357522465937,
862
+ "grad_norm": 0.23930482317882099,
863
+ "kl": 0.22203369140625,
864
+ "learning_rate": 1.2172460605507126e-05,
865
+ "loss": 0.0089,
866
+ "reward": 1.0633928991854191,
867
+ "reward_std": 0.20834395978599787,
868
+ "rewards/accuracy_reward": 0.14732143683359028,
869
+ "rewards/format_reward": 0.9160714574158192,
870
+ "step": 315
871
+ },
872
+ {
873
+ "completion_length": 206.31072359085084,
874
+ "epoch": 0.49473379070441587,
875
+ "grad_norm": 0.6156205700649886,
876
+ "kl": 0.269940185546875,
877
+ "learning_rate": 1.19077950712113e-05,
878
+ "loss": 0.0108,
879
+ "reward": 1.1491071924567222,
880
+ "reward_std": 0.19824243448674678,
881
+ "rewards/accuracy_reward": 0.19285715268924833,
882
+ "rewards/format_reward": 0.9562500186264515,
883
+ "step": 320
884
+ },
885
+ {
886
+ "completion_length": 208.71608085632323,
887
+ "epoch": 0.5024640061841724,
888
+ "grad_norm": 1.145359964951925,
889
+ "kl": 0.511163330078125,
890
+ "learning_rate": 1.1641735119630373e-05,
891
+ "loss": 0.0204,
892
+ "reward": 1.1169643372297287,
893
+ "reward_std": 0.1603617152199149,
894
+ "rewards/accuracy_reward": 0.1553571513853967,
895
+ "rewards/format_reward": 0.9616071604192257,
896
+ "step": 325
897
+ },
898
+ {
899
+ "completion_length": 228.73393907546998,
900
+ "epoch": 0.5101942216639289,
901
+ "grad_norm": 1.1998052405572637,
902
+ "kl": 0.50406494140625,
903
+ "learning_rate": 1.137447521535908e-05,
904
+ "loss": 0.0202,
905
+ "reward": 1.116964338719845,
906
+ "reward_std": 0.18309014700353146,
907
+ "rewards/accuracy_reward": 0.17767858058214187,
908
+ "rewards/format_reward": 0.9392857380211354,
909
+ "step": 330
910
+ },
911
+ {
912
+ "completion_length": 269.97054691314696,
913
+ "epoch": 0.5179244371436854,
914
+ "grad_norm": 0.8172230478043011,
915
+ "kl": 0.8053955078125,
916
+ "learning_rate": 1.110621070004378e-05,
917
+ "loss": 0.0322,
918
+ "reward": 1.0839286208152772,
919
+ "reward_std": 0.21213203221559523,
920
+ "rewards/accuracy_reward": 0.16071429383009672,
921
+ "rewards/format_reward": 0.923214315623045,
922
+ "step": 335
923
+ },
924
+ {
925
+ "completion_length": 279.7910852432251,
926
+ "epoch": 0.5256546526234419,
927
+ "grad_norm": 1.6253556390659387,
928
+ "kl": 0.6918701171875,
929
+ "learning_rate": 1.0837137649606241e-05,
930
+ "loss": 0.0277,
931
+ "reward": 1.1000000432133674,
932
+ "reward_std": 0.21465741395950316,
933
+ "rewards/accuracy_reward": 0.16875000894069672,
934
+ "rewards/format_reward": 0.9312500268220901,
935
+ "step": 340
936
+ },
937
+ {
938
+ "completion_length": 276.95269145965574,
939
+ "epoch": 0.5333848681031984,
940
+ "grad_norm": 0.2933386914876608,
941
+ "kl": 0.452520751953125,
942
+ "learning_rate": 1.0567452730930743e-05,
943
+ "loss": 0.0181,
944
+ "reward": 1.098214329779148,
945
+ "reward_std": 0.18940359950065613,
946
+ "rewards/accuracy_reward": 0.15625000828877092,
947
+ "rewards/format_reward": 0.9419643089175225,
948
+ "step": 345
949
+ },
950
+ {
951
+ "completion_length": 299.1473344802856,
952
+ "epoch": 0.5411150835829549,
953
+ "grad_norm": 0.5289323764160652,
954
+ "kl": 0.47991943359375,
955
+ "learning_rate": 1.0297353058119209e-05,
956
+ "loss": 0.0192,
957
+ "reward": 1.0937500461935996,
958
+ "reward_std": 0.16793785840272904,
959
+ "rewards/accuracy_reward": 0.14464286472648383,
960
+ "rewards/format_reward": 0.949107164144516,
961
+ "step": 350
962
+ },
963
+ {
964
+ "completion_length": 311.8223342895508,
965
+ "epoch": 0.5488452990627114,
966
+ "grad_norm": 1.3602546085242322,
967
+ "kl": 0.692169189453125,
968
+ "learning_rate": 1.0027036048419514e-05,
969
+ "loss": 0.0277,
970
+ "reward": 1.0892857573926449,
971
+ "reward_std": 0.21465741619467735,
972
+ "rewards/accuracy_reward": 0.16607143813744188,
973
+ "rewards/format_reward": 0.9232143141329289,
974
+ "step": 355
975
+ },
976
+ {
977
+ "completion_length": 313.5830512046814,
978
+ "epoch": 0.5565755145424679,
979
+ "grad_norm": 0.7626171074709401,
980
+ "kl": 0.92459716796875,
981
+ "learning_rate": 9.756699277932196e-06,
982
+ "loss": 0.037,
983
+ "reward": 1.0500000417232513,
984
+ "reward_std": 0.2197081744670868,
985
+ "rewards/accuracy_reward": 0.151785721629858,
986
+ "rewards/format_reward": 0.8982143223285675,
987
+ "step": 360
988
+ },
989
+ {
990
+ "completion_length": 303.96786937713625,
991
+ "epoch": 0.5643057300222244,
992
+ "grad_norm": 0.5235484185987127,
993
+ "kl": 1.150616455078125,
994
+ "learning_rate": 9.486540337201046e-06,
995
+ "loss": 0.046,
996
+ "reward": 1.0473214767873287,
997
+ "reward_std": 0.2841054029762745,
998
+ "rewards/accuracy_reward": 0.1696428654715419,
999
+ "rewards/format_reward": 0.8776786111295223,
1000
+ "step": 365
1001
+ },
1002
+ {
1003
+ "completion_length": 300.03840684890747,
1004
+ "epoch": 0.5720359455019809,
1005
+ "grad_norm": 3.258547728286983,
1006
+ "kl": 1.26881103515625,
1007
+ "learning_rate": 9.216756686793163e-06,
1008
+ "loss": 0.0508,
1009
+ "reward": 1.0267857626080512,
1010
+ "reward_std": 0.29294423535466196,
1011
+ "rewards/accuracy_reward": 0.16964286603033543,
1012
+ "rewards/format_reward": 0.8571428984403611,
1013
+ "step": 370
1014
+ },
1015
+ {
1016
+ "completion_length": 273.17501125335696,
1017
+ "epoch": 0.5797661609817374,
1018
+ "grad_norm": 0.5185689721492496,
1019
+ "kl": 1.03182373046875,
1020
+ "learning_rate": 8.94754551297402e-06,
1021
+ "loss": 0.0413,
1022
+ "reward": 1.1008929088711739,
1023
+ "reward_std": 0.281580020673573,
1024
+ "rewards/accuracy_reward": 0.20982144065201283,
1025
+ "rewards/format_reward": 0.8910714685916901,
1026
+ "step": 375
1027
+ },
1028
+ {
1029
+ "completion_length": 277.25179929733275,
1030
+ "epoch": 0.5874963764614939,
1031
+ "grad_norm": 1.6200792591726878,
1032
+ "kl": 1.022216796875,
1033
+ "learning_rate": 8.67910358358298e-06,
1034
+ "loss": 0.0409,
1035
+ "reward": 1.07053577080369,
1036
+ "reward_std": 0.2462246786803007,
1037
+ "rewards/accuracy_reward": 0.17767858020961286,
1038
+ "rewards/format_reward": 0.8928571790456772,
1039
+ "step": 380
1040
+ },
1041
+ {
1042
+ "completion_length": 242.19554595947267,
1043
+ "epoch": 0.5952265919412504,
1044
+ "grad_norm": 0.40227896921926315,
1045
+ "kl": 0.4446044921875,
1046
+ "learning_rate": 8.411627104214675e-06,
1047
+ "loss": 0.0178,
1048
+ "reward": 1.1410714834928513,
1049
+ "reward_std": 0.20455588828772306,
1050
+ "rewards/accuracy_reward": 0.19107143832370638,
1051
+ "rewards/format_reward": 0.9500000223517417,
1052
+ "step": 385
1053
+ },
1054
+ {
1055
+ "completion_length": 235.43483171463012,
1056
+ "epoch": 0.6029568074210069,
1057
+ "grad_norm": 0.8419867144446279,
1058
+ "kl": 0.502435302734375,
1059
+ "learning_rate": 8.145311574811325e-06,
1060
+ "loss": 0.0201,
1061
+ "reward": 1.1321429118514061,
1062
+ "reward_std": 0.19950512573122978,
1063
+ "rewards/accuracy_reward": 0.1866071516647935,
1064
+ "rewards/format_reward": 0.9455357372760773,
1065
+ "step": 390
1066
+ },
1067
+ {
1068
+ "completion_length": 221.21072397232055,
1069
+ "epoch": 0.6106870229007634,
1070
+ "grad_norm": 0.31467383074598076,
1071
+ "kl": 0.426025390625,
1072
+ "learning_rate": 7.880351646770824e-06,
1073
+ "loss": 0.017,
1074
+ "reward": 1.1294643431901932,
1075
+ "reward_std": 0.195717054232955,
1076
+ "rewards/accuracy_reward": 0.17321429420262574,
1077
+ "rewards/format_reward": 0.9562500178813934,
1078
+ "step": 395
1079
+ },
1080
+ {
1081
+ "completion_length": 241.07501096725463,
1082
+ "epoch": 0.6184172383805199,
1083
+ "grad_norm": 0.41760270752871836,
1084
+ "kl": 0.618927001953125,
1085
+ "learning_rate": 7.616940980675004e-06,
1086
+ "loss": 0.0248,
1087
+ "reward": 1.0866071924567222,
1088
+ "reward_std": 0.2058185778558254,
1089
+ "rewards/accuracy_reward": 0.15267857955768704,
1090
+ "rewards/format_reward": 0.9339286006987095,
1091
+ "step": 400
1092
+ },
1093
+ {
1094
+ "epoch": 0.6184172383805199,
1095
+ "eval_completion_length": 254.63423538208008,
1096
+ "eval_kl": 0.627685546875,
1097
+ "eval_loss": 0.026014825329184532,
1098
+ "eval_reward": 1.1428571939468384,
1099
+ "eval_reward_std": 0.21465741470456123,
1100
+ "eval_rewards/accuracy_reward": 0.20535715389996767,
1101
+ "eval_rewards/format_reward": 0.9375000298023224,
1102
+ "eval_runtime": 44.4534,
1103
+ "eval_samples_per_second": 2.227,
1104
+ "eval_steps_per_second": 0.09,
1105
+ "step": 400
1106
+ },
1107
+ {
1108
+ "completion_length": 248.6437618255615,
1109
+ "epoch": 0.6261474538602764,
1110
+ "grad_norm": 1.2063002170462727,
1111
+ "kl": 0.670556640625,
1112
+ "learning_rate": 7.355272104742132e-06,
1113
+ "loss": 0.0268,
1114
+ "reward": 1.1375000402331352,
1115
+ "reward_std": 0.21718279421329498,
1116
+ "rewards/accuracy_reward": 0.19732143972069024,
1117
+ "rewards/format_reward": 0.9401785954833031,
1118
+ "step": 405
1119
+ },
1120
+ {
1121
+ "completion_length": 265.54465503692626,
1122
+ "epoch": 0.6338776693400329,
1123
+ "grad_norm": 0.8372970894830984,
1124
+ "kl": 0.582720947265625,
1125
+ "learning_rate": 7.095536274107046e-06,
1126
+ "loss": 0.0233,
1127
+ "reward": 1.1276786223053932,
1128
+ "reward_std": 0.22349624708294868,
1129
+ "rewards/accuracy_reward": 0.2008928671479225,
1130
+ "rewards/format_reward": 0.9267857432365417,
1131
+ "step": 410
1132
+ },
1133
+ {
1134
+ "completion_length": 330.5723365783691,
1135
+ "epoch": 0.6416078848197894,
1136
+ "grad_norm": 1.0585292028421611,
1137
+ "kl": 12.741766357421875,
1138
+ "learning_rate": 6.837923331031761e-06,
1139
+ "loss": 0.5087,
1140
+ "reward": 1.0241071917116642,
1141
+ "reward_std": 0.2512754438444972,
1142
+ "rewards/accuracy_reward": 0.16250000838190318,
1143
+ "rewards/format_reward": 0.8616071842610836,
1144
+ "step": 415
1145
+ },
1146
+ {
1147
+ "completion_length": 328.27858657836913,
1148
+ "epoch": 0.6493381002995459,
1149
+ "grad_norm": 0.49066051087453305,
1150
+ "kl": 1.210614013671875,
1151
+ "learning_rate": 6.58262156614881e-06,
1152
+ "loss": 0.0485,
1153
+ "reward": 1.0526786148548126,
1154
+ "reward_std": 0.2916815456002951,
1155
+ "rewards/accuracy_reward": 0.19642858104780317,
1156
+ "rewards/format_reward": 0.8562500372529029,
1157
+ "step": 420
1158
+ },
1159
+ {
1160
+ "completion_length": 308.60983619689944,
1161
+ "epoch": 0.6570683157793024,
1162
+ "grad_norm": 0.9219993337746698,
1163
+ "kl": 1.082733154296875,
1164
+ "learning_rate": 6.3298175808386284e-06,
1165
+ "loss": 0.0433,
1166
+ "reward": 1.0651786200702191,
1167
+ "reward_std": 0.2765292562544346,
1168
+ "rewards/accuracy_reward": 0.18928572442382574,
1169
+ "rewards/format_reward": 0.8758928962051868,
1170
+ "step": 425
1171
+ },
1172
+ {
1173
+ "completion_length": 256.632155418396,
1174
+ "epoch": 0.6647985312590589,
1175
+ "grad_norm": 0.4539816272276013,
1176
+ "kl": 0.761395263671875,
1177
+ "learning_rate": 6.079696150841634e-06,
1178
+ "loss": 0.0305,
1179
+ "reward": 1.1026786148548127,
1180
+ "reward_std": 0.22854701150208712,
1181
+ "rewards/accuracy_reward": 0.17857143841683865,
1182
+ "rewards/format_reward": 0.9241071738302707,
1183
+ "step": 430
1184
+ },
1185
+ {
1186
+ "completion_length": 235.15983219146727,
1187
+ "epoch": 0.6725287467388154,
1188
+ "grad_norm": 0.2974800717301805,
1189
+ "kl": 0.74510498046875,
1190
+ "learning_rate": 5.832440091204698e-06,
1191
+ "loss": 0.0298,
1192
+ "reward": 1.1125000521540642,
1193
+ "reward_std": 0.19445436242967845,
1194
+ "rewards/accuracy_reward": 0.17500000884756445,
1195
+ "rewards/format_reward": 0.9375000268220901,
1196
+ "step": 435
1197
+ },
1198
+ {
1199
+ "completion_length": 245.59376096725464,
1200
+ "epoch": 0.6802589622185718,
1201
+ "grad_norm": 1.0471525533631525,
1202
+ "kl": 0.828253173828125,
1203
+ "learning_rate": 5.588230122660672e-06,
1204
+ "loss": 0.0331,
1205
+ "reward": 1.1330357663333417,
1206
+ "reward_std": 0.23107239231467247,
1207
+ "rewards/accuracy_reward": 0.21160715389996768,
1208
+ "rewards/format_reward": 0.9214286014437676,
1209
+ "step": 440
1210
+ },
1211
+ {
1212
+ "completion_length": 260.4134042739868,
1213
+ "epoch": 0.6879891776983283,
1214
+ "grad_norm": 1.0462583292018497,
1215
+ "kl": 1.00888671875,
1216
+ "learning_rate": 5.347244739538677e-06,
1217
+ "loss": 0.0404,
1218
+ "reward": 1.1080357603728772,
1219
+ "reward_std": 0.24622467998415232,
1220
+ "rewards/accuracy_reward": 0.2017857247032225,
1221
+ "rewards/format_reward": 0.9062500357627868,
1222
+ "step": 445
1223
+ },
1224
+ {
1225
+ "completion_length": 260.433941078186,
1226
+ "epoch": 0.6957193931780848,
1227
+ "grad_norm": 0.6455052674276692,
1228
+ "kl": 0.912286376953125,
1229
+ "learning_rate": 5.109660079301668e-06,
1230
+ "loss": 0.0365,
1231
+ "reward": 1.1223214827477932,
1232
+ "reward_std": 0.2335977738723159,
1233
+ "rewards/accuracy_reward": 0.20089286817237734,
1234
+ "rewards/format_reward": 0.9214286021888256,
1235
+ "step": 450
1236
+ },
1237
+ {
1238
+ "completion_length": 256.63661804199216,
1239
+ "epoch": 0.7034496086578413,
1240
+ "grad_norm": 0.6457302051700157,
1241
+ "kl": 0.8274169921875,
1242
+ "learning_rate": 4.875649793806655e-06,
1243
+ "loss": 0.0331,
1244
+ "reward": 1.1169643312692643,
1245
+ "reward_std": 0.2260216299444437,
1246
+ "rewards/accuracy_reward": 0.19107143823057413,
1247
+ "rewards/format_reward": 0.9258928872644901,
1248
+ "step": 455
1249
+ },
1250
+ {
1251
+ "completion_length": 288.9500138282776,
1252
+ "epoch": 0.7111798241375978,
1253
+ "grad_norm": 0.24130951862078845,
1254
+ "kl": 1.132049560546875,
1255
+ "learning_rate": 4.64538492238166e-06,
1256
+ "loss": 0.0453,
1257
+ "reward": 1.1035714767873288,
1258
+ "reward_std": 0.24748737178742886,
1259
+ "rewards/accuracy_reward": 0.21071429643779993,
1260
+ "rewards/format_reward": 0.8928571783006192,
1261
+ "step": 460
1262
+ },
1263
+ {
1264
+ "completion_length": 274.6125121116638,
1265
+ "epoch": 0.7189100396173543,
1266
+ "grad_norm": 0.6809081363150956,
1267
+ "kl": 0.901507568359375,
1268
+ "learning_rate": 4.4190337668121964e-06,
1269
+ "loss": 0.0361,
1270
+ "reward": 1.1500000551342964,
1271
+ "reward_std": 0.2500127531588078,
1272
+ "rewards/accuracy_reward": 0.22767858253791928,
1273
+ "rewards/format_reward": 0.9223214589059353,
1274
+ "step": 465
1275
+ },
1276
+ {
1277
+ "completion_length": 269.7616189956665,
1278
+ "epoch": 0.7266402550971108,
1279
+ "grad_norm": 0.5074301520662177,
1280
+ "kl": 0.875689697265625,
1281
+ "learning_rate": 4.196761768328599e-06,
1282
+ "loss": 0.0351,
1283
+ "reward": 1.1383929133415223,
1284
+ "reward_std": 0.2209708670154214,
1285
+ "rewards/accuracy_reward": 0.2125000100582838,
1286
+ "rewards/format_reward": 0.9258928872644901,
1287
+ "step": 470
1288
+ },
1289
+ {
1290
+ "completion_length": 265.1625129699707,
1291
+ "epoch": 0.7343704705768673,
1292
+ "grad_norm": 0.6415529481366966,
1293
+ "kl": 0.763165283203125,
1294
+ "learning_rate": 3.978731386684206e-06,
1295
+ "loss": 0.0305,
1296
+ "reward": 1.1428571954369544,
1297
+ "reward_std": 0.25001275185495614,
1298
+ "rewards/accuracy_reward": 0.2160714398138225,
1299
+ "rewards/format_reward": 0.9267857432365417,
1300
+ "step": 475
1301
+ },
1302
+ {
1303
+ "completion_length": 272.5562623977661,
1304
+ "epoch": 0.7421006860566238,
1305
+ "grad_norm": 0.35234126624207923,
1306
+ "kl": 0.82550048828125,
1307
+ "learning_rate": 3.7651019814126656e-06,
1308
+ "loss": 0.033,
1309
+ "reward": 1.1383929066359997,
1310
+ "reward_std": 0.21339472401887177,
1311
+ "rewards/accuracy_reward": 0.21339286882430314,
1312
+ "rewards/format_reward": 0.9250000290572643,
1313
+ "step": 480
1314
+ },
1315
+ {
1316
+ "completion_length": 272.2678701400757,
1317
+ "epoch": 0.7498309015363803,
1318
+ "grad_norm": 0.9100314610541868,
1319
+ "kl": 0.79708251953125,
1320
+ "learning_rate": 3.5560296953512296e-06,
1321
+ "loss": 0.0319,
1322
+ "reward": 1.1250000461935996,
1323
+ "reward_std": 0.23233508188277482,
1324
+ "rewards/accuracy_reward": 0.1973214376717806,
1325
+ "rewards/format_reward": 0.9276786029338837,
1326
+ "step": 485
1327
+ },
1328
+ {
1329
+ "completion_length": 280.9955499649048,
1330
+ "epoch": 0.7575611170161368,
1331
+ "grad_norm": 0.3556399924481661,
1332
+ "kl": 0.9994873046875,
1333
+ "learning_rate": 3.3516673405151546e-06,
1334
+ "loss": 0.04,
1335
+ "reward": 1.1285714760422707,
1336
+ "reward_std": 0.2247589396312833,
1337
+ "rewards/accuracy_reward": 0.2062500107102096,
1338
+ "rewards/format_reward": 0.9223214603960515,
1339
+ "step": 490
1340
+ },
1341
+ {
1342
+ "completion_length": 276.1401895523071,
1343
+ "epoch": 0.7652913324958933,
1344
+ "grad_norm": 0.5581701650895203,
1345
+ "kl": 0.73818359375,
1346
+ "learning_rate": 3.1521642864065905e-06,
1347
+ "loss": 0.0295,
1348
+ "reward": 1.150892909616232,
1349
+ "reward_std": 0.22602162901312112,
1350
+ "rewards/accuracy_reward": 0.21785715483129026,
1351
+ "rewards/format_reward": 0.9330357424914837,
1352
+ "step": 495
1353
+ },
1354
+ {
1355
+ "completion_length": 305.8750144004822,
1356
+ "epoch": 0.7730215479756498,
1357
+ "grad_norm": 0.2985575344762852,
1358
+ "kl": 1.02808837890625,
1359
+ "learning_rate": 2.957666350839663e-06,
1360
+ "loss": 0.0411,
1361
+ "reward": 1.145535769313574,
1362
+ "reward_std": 0.23612315505743026,
1363
+ "rewards/accuracy_reward": 0.22946429708972574,
1364
+ "rewards/format_reward": 0.9160714574158192,
1365
+ "step": 500
1366
+ },
1367
+ {
1368
+ "epoch": 0.7730215479756498,
1369
+ "eval_completion_length": 257.7681636810303,
1370
+ "eval_kl": 0.739013671875,
1371
+ "eval_loss": 0.030957935377955437,
1372
+ "eval_reward": 1.191964328289032,
1373
+ "eval_reward_std": 0.27147849928587675,
1374
+ "eval_rewards/accuracy_reward": 0.263392873108387,
1375
+ "eval_rewards/format_reward": 0.928571455180645,
1376
+ "eval_runtime": 39.965,
1377
+ "eval_samples_per_second": 2.477,
1378
+ "eval_steps_per_second": 0.1,
1379
+ "step": 500
1380
+ },
1381
+ {
1382
+ "completion_length": 284.98126316070557,
1383
+ "epoch": 0.7807517634554063,
1384
+ "grad_norm": 0.2775335619661649,
1385
+ "kl": 0.804058837890625,
1386
+ "learning_rate": 2.768315693361474e-06,
1387
+ "loss": 0.0322,
1388
+ "reward": 1.1401786215603351,
1389
+ "reward_std": 0.2335977740585804,
1390
+ "rewards/accuracy_reward": 0.22589286724105478,
1391
+ "rewards/format_reward": 0.9142857484519482,
1392
+ "step": 505
1393
+ },
1394
+ {
1395
+ "completion_length": 302.475905418396,
1396
+ "epoch": 0.7884819789351628,
1397
+ "grad_norm": 0.4783830317374093,
1398
+ "kl": 0.930975341796875,
1399
+ "learning_rate": 2.5842507113469307e-06,
1400
+ "loss": 0.0373,
1401
+ "reward": 1.1107143394649028,
1402
+ "reward_std": 0.21970817670226098,
1403
+ "rewards/accuracy_reward": 0.1910714373923838,
1404
+ "rewards/format_reward": 0.9196428842842579,
1405
+ "step": 510
1406
+ },
1407
+ {
1408
+ "completion_length": 326.8464429855347,
1409
+ "epoch": 0.7962121944149193,
1410
+ "grad_norm": 0.4587112625836826,
1411
+ "kl": 1.243695068359375,
1412
+ "learning_rate": 2.405605938843416e-06,
1413
+ "loss": 0.0498,
1414
+ "reward": 1.0883929036557674,
1415
+ "reward_std": 0.2916815454140306,
1416
+ "rewards/accuracy_reward": 0.2071428682655096,
1417
+ "rewards/format_reward": 0.881250037252903,
1418
+ "step": 515
1419
+ },
1420
+ {
1421
+ "completion_length": 338.72858772277834,
1422
+ "epoch": 0.8039424098946758,
1423
+ "grad_norm": 0.5814971897829918,
1424
+ "kl": 1.364935302734375,
1425
+ "learning_rate": 2.2325119482391466e-06,
1426
+ "loss": 0.0546,
1427
+ "reward": 1.0830357670783997,
1428
+ "reward_std": 0.3118845963850617,
1429
+ "rewards/accuracy_reward": 0.2160714386962354,
1430
+ "rewards/format_reward": 0.866964328289032,
1431
+ "step": 520
1432
+ },
1433
+ {
1434
+ "completion_length": 354.6250160217285,
1435
+ "epoch": 0.8116726253744323,
1436
+ "grad_norm": 0.5395627446463483,
1437
+ "kl": 1.560418701171875,
1438
+ "learning_rate": 2.065095254827133e-06,
1439
+ "loss": 0.0624,
1440
+ "reward": 1.0625000484287739,
1441
+ "reward_std": 0.2954696161672473,
1442
+ "rewards/accuracy_reward": 0.21517858309671284,
1443
+ "rewards/format_reward": 0.847321467846632,
1444
+ "step": 525
1445
+ },
1446
+ {
1447
+ "completion_length": 354.5348379135132,
1448
+ "epoch": 0.8194028408541888,
1449
+ "grad_norm": 0.5903315470866531,
1450
+ "kl": 1.75145263671875,
1451
+ "learning_rate": 1.9034782243345074e-06,
1452
+ "loss": 0.0701,
1453
+ "reward": 1.0437500521540641,
1454
+ "reward_std": 0.3118845956400037,
1455
+ "rewards/accuracy_reward": 0.20714286724105477,
1456
+ "rewards/format_reward": 0.8366071812808513,
1457
+ "step": 530
1458
+ },
1459
+ {
1460
+ "completion_length": 355.27947940826414,
1461
+ "epoch": 0.8271330563339453,
1462
+ "grad_norm": 1.2208377616122188,
1463
+ "kl": 1.34444580078125,
1464
+ "learning_rate": 1.7477789834847835e-06,
1465
+ "loss": 0.0538,
1466
+ "reward": 1.0562500461935997,
1467
+ "reward_std": 0.3043084528297186,
1468
+ "rewards/accuracy_reward": 0.20982143869623543,
1469
+ "rewards/format_reward": 0.8464286126196384,
1470
+ "step": 535
1471
+ },
1472
+ {
1473
+ "completion_length": 332.879479598999,
1474
+ "epoch": 0.8348632718137018,
1475
+ "grad_norm": 0.5240182175615019,
1476
+ "kl": 1.40631103515625,
1477
+ "learning_rate": 1.5981113336584041e-06,
1478
+ "loss": 0.0563,
1479
+ "reward": 1.0500000439584256,
1480
+ "reward_std": 0.2626396602019668,
1481
+ "rewards/accuracy_reward": 0.1973214391618967,
1482
+ "rewards/format_reward": 0.8526786088943481,
1483
+ "step": 540
1484
+ },
1485
+ {
1486
+ "completion_length": 323.62679920196535,
1487
+ "epoch": 0.8425934872934583,
1488
+ "grad_norm": 0.6433085422004733,
1489
+ "kl": 1.07110595703125,
1490
+ "learning_rate": 1.4545846677147446e-06,
1491
+ "loss": 0.0429,
1492
+ "reward": 1.0964286200702191,
1493
+ "reward_std": 0.23991122860461472,
1494
+ "rewards/accuracy_reward": 0.21250001098960639,
1495
+ "rewards/format_reward": 0.8839286118745804,
1496
+ "step": 545
1497
+ },
1498
+ {
1499
+ "completion_length": 286.84554920196535,
1500
+ "epoch": 0.8503237027732148,
1501
+ "grad_norm": 0.4618904983113792,
1502
+ "kl": 0.94029541015625,
1503
+ "learning_rate": 1.3173038900362977e-06,
1504
+ "loss": 0.0376,
1505
+ "reward": 1.1437500521540642,
1506
+ "reward_std": 0.26390235032886267,
1507
+ "rewards/accuracy_reward": 0.22857144000008703,
1508
+ "rewards/format_reward": 0.9151786006987095,
1509
+ "step": 550
1510
+ },
1511
+ {
1512
+ "completion_length": 281.5741184234619,
1513
+ "epoch": 0.8580539182529713,
1514
+ "grad_norm": 0.4684944225912412,
1515
+ "kl": 1.12393798828125,
1516
+ "learning_rate": 1.1863693398535115e-06,
1517
+ "loss": 0.045,
1518
+ "reward": 1.1348214790225029,
1519
+ "reward_std": 0.2588515877723694,
1520
+ "rewards/accuracy_reward": 0.22589286798611283,
1521
+ "rewards/format_reward": 0.908928605914116,
1522
+ "step": 555
1523
+ },
1524
+ {
1525
+ "completion_length": 291.7946542739868,
1526
+ "epoch": 0.8657841337327278,
1527
+ "grad_norm": 0.3860335559805723,
1528
+ "kl": 1.16566162109375,
1529
+ "learning_rate": 1.0618767179063416e-06,
1530
+ "loss": 0.0466,
1531
+ "reward": 1.13125004991889,
1532
+ "reward_std": 0.253800824098289,
1533
+ "rewards/accuracy_reward": 0.23928572591394187,
1534
+ "rewards/format_reward": 0.8919643238186836,
1535
+ "step": 560
1536
+ },
1537
+ {
1538
+ "completion_length": 288.9223350524902,
1539
+ "epoch": 0.8735143492124843,
1540
+ "grad_norm": 0.5947417112761748,
1541
+ "kl": 1.116168212890625,
1542
+ "learning_rate": 9.439170164960765e-07,
1543
+ "loss": 0.0446,
1544
+ "reward": 1.1133929133415221,
1545
+ "reward_std": 0.26137696839869023,
1546
+ "rewards/accuracy_reward": 0.22410715520381927,
1547
+ "rewards/format_reward": 0.8892857529222965,
1548
+ "step": 565
1549
+ },
1550
+ {
1551
+ "completion_length": 312.45358619689944,
1552
+ "epoch": 0.8812445646922408,
1553
+ "grad_norm": 0.5248985437711838,
1554
+ "kl": 1.397430419921875,
1555
+ "learning_rate": 8.325764529785851e-07,
1556
+ "loss": 0.0559,
1557
+ "reward": 1.0696429111063481,
1558
+ "reward_std": 0.2676904214546084,
1559
+ "rewards/accuracy_reward": 0.20178572395816446,
1560
+ "rewards/format_reward": 0.8678571797907353,
1561
+ "step": 570
1562
+ },
1563
+ {
1564
+ "completion_length": 295.79108448028563,
1565
+ "epoch": 0.8889747801719973,
1566
+ "grad_norm": 0.450179012206087,
1567
+ "kl": 1.271270751953125,
1568
+ "learning_rate": 7.279364067476247e-07,
1569
+ "loss": 0.0509,
1570
+ "reward": 1.0758929073810577,
1571
+ "reward_std": 0.2310723926872015,
1572
+ "rewards/accuracy_reward": 0.19910715306177734,
1573
+ "rewards/format_reward": 0.8767857521772384,
1574
+ "step": 575
1575
+ },
1576
+ {
1577
+ "completion_length": 286.98304786682127,
1578
+ "epoch": 0.8967049956517538,
1579
+ "grad_norm": 0.283165116362945,
1580
+ "kl": 1.03193359375,
1581
+ "learning_rate": 6.300733597542086e-07,
1582
+ "loss": 0.0413,
1583
+ "reward": 1.120535772293806,
1584
+ "reward_std": 0.24117391742765903,
1585
+ "rewards/accuracy_reward": 0.21964286752045153,
1586
+ "rewards/format_reward": 0.9008928939700127,
1587
+ "step": 580
1588
+ },
1589
+ {
1590
+ "completion_length": 301.77501583099365,
1591
+ "epoch": 0.9044352111315103,
1592
+ "grad_norm": 0.46974734005334373,
1593
+ "kl": 1.178570556640625,
1594
+ "learning_rate": 5.390588406055497e-07,
1595
+ "loss": 0.0471,
1596
+ "reward": 1.0892857603728772,
1597
+ "reward_std": 0.25253813378512857,
1598
+ "rewards/accuracy_reward": 0.20178572423756122,
1599
+ "rewards/format_reward": 0.8875000402331352,
1600
+ "step": 585
1601
+ },
1602
+ {
1603
+ "completion_length": 260.1544776916504,
1604
+ "epoch": 0.9121654266112668,
1605
+ "grad_norm": 0.3415945982894548,
1606
+ "kl": 0.71856689453125,
1607
+ "learning_rate": 4.549593722844492e-07,
1608
+ "loss": 0.0287,
1609
+ "reward": 1.1750000528991222,
1610
+ "reward_std": 0.21970817632973194,
1611
+ "rewards/accuracy_reward": 0.2366071536205709,
1612
+ "rewards/format_reward": 0.9383928820490837,
1613
+ "step": 590
1614
+ },
1615
+ {
1616
+ "completion_length": 265.1803674697876,
1617
+ "epoch": 0.9198956420910233,
1618
+ "grad_norm": 0.2708029106865859,
1619
+ "kl": 0.8180908203125,
1620
+ "learning_rate": 3.77836423527278e-07,
1621
+ "loss": 0.0328,
1622
+ "reward": 1.188392923027277,
1623
+ "reward_std": 0.2386485354974866,
1624
+ "rewards/accuracy_reward": 0.255357154738158,
1625
+ "rewards/format_reward": 0.9330357387661934,
1626
+ "step": 595
1627
+ },
1628
+ {
1629
+ "completion_length": 270.0437623023987,
1630
+ "epoch": 0.9276258575707798,
1631
+ "grad_norm": 0.46374471812077694,
1632
+ "kl": 0.837054443359375,
1633
+ "learning_rate": 3.0774636389618196e-07,
1634
+ "loss": 0.0335,
1635
+ "reward": 1.1705357685685158,
1636
+ "reward_std": 0.24875006210058928,
1637
+ "rewards/accuracy_reward": 0.23571429755538703,
1638
+ "rewards/format_reward": 0.9348214536905288,
1639
+ "step": 600
1640
+ },
1641
+ {
1642
+ "epoch": 0.9276258575707798,
1643
+ "eval_completion_length": 265.839298248291,
1644
+ "eval_kl": 0.89306640625,
1645
+ "eval_loss": 0.037714019417762756,
1646
+ "eval_reward": 1.1875000447034836,
1647
+ "eval_reward_std": 0.2904188595712185,
1648
+ "eval_rewards/accuracy_reward": 0.2678571529686451,
1649
+ "eval_rewards/format_reward": 0.9196428880095482,
1650
+ "eval_runtime": 44.0608,
1651
+ "eval_samples_per_second": 2.247,
1652
+ "eval_steps_per_second": 0.091,
1653
+ "step": 600
1654
+ },
1655
+ {
1656
+ "completion_length": 283.5410858154297,
1657
+ "epoch": 0.9353560730505363,
1658
+ "grad_norm": 0.42477771152185795,
1659
+ "kl": 1.058807373046875,
1660
+ "learning_rate": 2.44740422578269e-07,
1661
+ "loss": 0.0423,
1662
+ "reward": 1.1196429051458836,
1663
+ "reward_std": 0.26516504045575856,
1664
+ "rewards/accuracy_reward": 0.21250001108273864,
1665
+ "rewards/format_reward": 0.9071428932249546,
1666
+ "step": 605
1667
+ },
1668
+ {
1669
+ "completion_length": 263.7607263565063,
1670
+ "epoch": 0.9430862885302927,
1671
+ "grad_norm": 0.4567363340156383,
1672
+ "kl": 0.876043701171875,
1673
+ "learning_rate": 1.8886465094192895e-07,
1674
+ "loss": 0.035,
1675
+ "reward": 1.1633929111063481,
1676
+ "reward_std": 0.23359777368605136,
1677
+ "rewards/accuracy_reward": 0.2330357262864709,
1678
+ "rewards/format_reward": 0.9303571738302707,
1679
+ "step": 610
1680
+ },
1681
+ {
1682
+ "completion_length": 270.82501182556155,
1683
+ "epoch": 0.9508165040100492,
1684
+ "grad_norm": 0.28225043605895783,
1685
+ "kl": 0.95989990234375,
1686
+ "learning_rate": 1.401598888776523e-07,
1687
+ "loss": 0.0384,
1688
+ "reward": 1.15535718947649,
1689
+ "reward_std": 0.24748737160116435,
1690
+ "rewards/accuracy_reward": 0.2357142989523709,
1691
+ "rewards/format_reward": 0.9196428880095482,
1692
+ "step": 615
1693
+ },
1694
+ {
1695
+ "completion_length": 283.83126201629636,
1696
+ "epoch": 0.9585467194898057,
1697
+ "grad_norm": 0.32539772814205725,
1698
+ "kl": 1.015838623046875,
1699
+ "learning_rate": 9.866173494794462e-08,
1700
+ "loss": 0.0406,
1701
+ "reward": 1.1410714834928513,
1702
+ "reward_std": 0.2651650408282876,
1703
+ "rewards/accuracy_reward": 0.23214286779984833,
1704
+ "rewards/format_reward": 0.908928606659174,
1705
+ "step": 620
1706
+ },
1707
+ {
1708
+ "completion_length": 272.28661918640137,
1709
+ "epoch": 0.9662769349695622,
1710
+ "grad_norm": 0.40483227452880277,
1711
+ "kl": 0.98214111328125,
1712
+ "learning_rate": 6.440052036815081e-08,
1713
+ "loss": 0.0393,
1714
+ "reward": 1.1419643364846706,
1715
+ "reward_std": 0.23359777443110943,
1716
+ "rewards/accuracy_reward": 0.22410715464502573,
1717
+ "rewards/format_reward": 0.9178571730852128,
1718
+ "step": 625
1719
+ },
1720
+ {
1721
+ "completion_length": 296.1392983436584,
1722
+ "epoch": 0.9740071504493187,
1723
+ "grad_norm": 0.5042793676495506,
1724
+ "kl": 1.05089111328125,
1725
+ "learning_rate": 3.7401286837214224e-08,
1726
+ "loss": 0.042,
1727
+ "reward": 1.1464286252856255,
1728
+ "reward_std": 0.26516504120081663,
1729
+ "rewards/accuracy_reward": 0.2383928676135838,
1730
+ "rewards/format_reward": 0.9080357499420643,
1731
+ "step": 630
1732
+ },
1733
+ {
1734
+ "completion_length": 291.5410856246948,
1735
+ "epoch": 0.9817373659290752,
1736
+ "grad_norm": 0.2778912462099483,
1737
+ "kl": 1.029638671875,
1738
+ "learning_rate": 1.7683768234568745e-08,
1739
+ "loss": 0.0412,
1740
+ "reward": 1.1250000447034836,
1741
+ "reward_std": 0.2575888976454735,
1742
+ "rewards/accuracy_reward": 0.2125000107102096,
1743
+ "rewards/format_reward": 0.9125000350177288,
1744
+ "step": 635
1745
+ },
1746
+ {
1747
+ "completion_length": 278.6035831451416,
1748
+ "epoch": 0.9894675814088317,
1749
+ "grad_norm": 0.4021017187067133,
1750
+ "kl": 1.000689697265625,
1751
+ "learning_rate": 5.262376196544239e-09,
1752
+ "loss": 0.04,
1753
+ "reward": 1.13035718947649,
1754
+ "reward_std": 0.2575888967141509,
1755
+ "rewards/accuracy_reward": 0.21875001173466443,
1756
+ "rewards/format_reward": 0.9116071738302708,
1757
+ "step": 640
1758
+ },
1759
+ {
1760
+ "completion_length": 272.8669765472412,
1761
+ "epoch": 0.9971977968885882,
1762
+ "grad_norm": 0.22818184024650953,
1763
+ "kl": 0.9707275390625,
1764
+ "learning_rate": 1.461895828280824e-10,
1765
+ "loss": 0.0388,
1766
+ "reward": 1.1553571961820126,
1767
+ "reward_std": 0.23486046474426986,
1768
+ "rewards/accuracy_reward": 0.23839286966249346,
1769
+ "rewards/format_reward": 0.9169643193483352,
1770
+ "step": 645
1771
+ },
1772
+ {
1773
+ "completion_length": 259.26786708831787,
1774
+ "epoch": 0.9987438399845395,
1775
+ "kl": 0.823211669921875,
1776
+ "reward": 1.1785714849829674,
1777
+ "reward_std": 0.2399112293496728,
1778
+ "rewards/accuracy_reward": 0.24553572246804833,
1779
+ "rewards/format_reward": 0.9330357387661934,
1780
+ "step": 646,
1781
+ "total_flos": 0.0,
1782
+ "train_loss": 0.13680233840041295,
1783
+ "train_runtime": 51841.8905,
1784
+ "train_samples_per_second": 1.397,
1785
+ "train_steps_per_second": 0.012
1786
+ }
1787
+ ],
1788
+ "logging_steps": 5,
1789
+ "max_steps": 646,
1790
+ "num_input_tokens_seen": 0,
1791
+ "num_train_epochs": 1,
1792
+ "save_steps": 100,
1793
+ "stateful_callbacks": {
1794
+ "TrainerControl": {
1795
+ "args": {
1796
+ "should_epoch_stop": false,
1797
+ "should_evaluate": false,
1798
+ "should_log": false,
1799
+ "should_save": true,
1800
+ "should_training_stop": true
1801
+ },
1802
+ "attributes": {}
1803
+ }
1804
+ },
1805
+ "total_flos": 0.0,
1806
+ "train_batch_size": 2,
1807
+ "trial_name": null,
1808
+ "trial_params": null
1809
+ }